Multi-Objective Evolutionary Rule-Based Classification with Categorical Data

Fernando Jiménez; Carlos Martínez; Luis Miralles-Pechuán; Gracia Sánchez; Guido Sciavicco

doi:10.3390/e20090684

. 2018 Sep 7;20(9):684. doi: 10.3390/e20090684

Multi-Objective Evolutionary Rule-Based Classification with Categorical Data

Fernando Jiménez ^1,^*, Carlos Martínez ¹, Luis Miralles-Pechuán ², Gracia Sánchez ¹, Guido Sciavicco ³

PMCID: PMC7513209 PMID: 33265773

Abstract

The ease of interpretation of a classification model is essential for the task of validating it. Sometimes it is required to clearly explain the classification process of a model’s predictions. Models which are inherently easier to interpret can be effortlessly related to the context of the problem, and their predictions can be, if necessary, ethically and legally evaluated. In this paper, we propose a novel method to generate rule-based classifiers from categorical data that can be readily interpreted. Classifiers are generated using a multi-objective optimization approach focusing on two main objectives: maximizing the performance of the learned classifier and minimizing its number of rules. The multi-objective evolutionary algorithms ENORA and NSGA-II have been adapted to optimize the performance of the classifier based on three different machine learning metrics: accuracy, area under the ROC curve, and root mean square error. We have extensively compared the generated classifiers using our proposed method with classifiers generated using classical methods such as PART, JRip, OneR and ZeroR. The experiments have been conducted in full training mode, in 10-fold cross-validation mode, and in train/test splitting mode. To make results reproducible, we have used the well-known and publicly available datasets Breast Cancer, Monk’s Problem 2, Tic-Tac-Toe-Endgame, Car, kr-vs-kp and Nursery. After performing an exhaustive statistical test on our results, we conclude that the proposed method is able to generate highly accurate and easy to interpret classification models.

Keywords: multi-objective evolutionary algorithms, rule-based classifiers, interpretable machine learning, categorical data

1. Introduction

Supervised Learning is the branch of Machine Learning (ML) [1] focused on modeling the behavior of systems that can be found in the environment. Supervised models are created from a set of past records, each one of which, usually, consists of an input vector labeled with an output. A supervised model is an algorithm that simulates the function that maps inputs with outputs [2]. The best models are those that predict the output of new inputs in the most accurate way. Thanks to modern computing capabilities, and to the digitization of ever-increasing quantities of data, nowadays, supervised learning techniques play a leading role in many applications. The first classification systems date back to the 1990s; in those days, researchers were focused on both precision and interpretability, and the systems to be modeled were relatively simple. Years later, when it became necessary to model more difficult behaviors, the researchers focused on developing more and more precise models, leaving aside the interpretability. Artificial Neural Networks (ANN) [3], and, more recently, Deep Learning Neural Networks (DLNN) [4], as well as Support Vector Machines (SVM) [5], and Instance-based Learning (IBL) [6] are archetypical examples of this approach. A DLNN, for example, is a large mesh of ordered nodes arranged in a hierarchical manner and composed of a huge number of variables. DLNNs are capable of modeling very complex behaviors, but it is extremely difficult to understand the logic behind their predictions, and similar considerations can be drawn for SVNs and IBLs, although the underlying principles are different. These models are known as black-box methods. While there are applications in which knowing the ratio behind a prediction is not necessarily relevant, (e.g., predicting a currency’s future value, whether or not a user clicks on an advert or the amount of rain in a certain area), there are other situations where the interpretability of a model plays a key role.

The interpretability of classification systems refers to the ability they have to explain their behavior in a way that is easily understandable by a user [7]. In other words, a model is considered interpretable when a human is able to understand the logic behind its prediction. In this way, Interpretable classification models allow external validation by an expert. Additionally, there are certain disciplines such as medicine, where it is essential to provide information about decision making for ethical and human reasons. Likewise, when a public institution asks an authority for permission to investigate an alleged offender, or when the CEO of a certain company wants to take a difficult decision which can seriously change the direction of the company, some kind of explanations to justify these decisions may be required. In these situations, using transparent (also called grey-box) models is recommended. While there is a general consensus on how the performance of a classification system is measured (popular metrics include accuracy, area under the ROC curve, and root mean square error), there is no universally accepted metric to measure the interpretability of the models. Nor is there an ideal balance between the interpretability and performance of classification systems but this depends on the specific application domain. However, the rule of thumb says that the simpler a classification system is, the easier it is to interpret. Rule-based Classifiers (RBC) [8,9] are among the most popular interpretable models, and some authors define the degree of interpretability of an RBC as the number of its rules or as the number of axioms that the rules have. These metrics tend to reward models with fewer rules as simple as possible [10,11]. In general, RBCs are classification learning systems that achieve a high level of interpretability because they are based on a human-like logic. Rules follow a very simple schema:

IF (Condition 1) and (Condition 2) and … (Condition N) THEN (Statement)

and the fewer rules the models have and the fewer conditions and attributes the rules have, the easier it will be for a human to understand the logic behind each classification. In fact, RBCs are so natural in some applications that they are used to interpret other classification models such as Decision Trees (DT) [12]. RBCs constitute the basis of more complex classification systems based on fuzzy logic [13] such as LogitBoost or AdaBoost [14].

Our approach investigates the conflict between accuracy and interpretability as a multi-objective optimization problem. We define a solution as a set of rules (that is, a classifier), and establish two objectives to be maximized: interpretability and accuracy. We decided to solve this problem by applying multi-objective evolutionary algorithms (MOEA) [15,16] as meta-heuristics, and, in particular, two known algorithms: NSGA-II [15] and ENORA [17]. They are both state-of-the-art evolutionary algorithms which have been applied, and compared, on several occasions [18,19,20]. NSGA-II is very well-known and has the advantage of being available in many implementations, while ENORA generally has a higher performance. In the current literature, MOEAs are mainly used for learning RBCs based on fuzzy logic [18,21,22,23,24,25,26]. However, Fuzzy RBCs are designed for numerical data, from which fuzzy sets are constructed and represented by linguistic labels. In this paper, on the contrary, we are interested in RBCs for categorical data, for which a novel approach is necessary.

This paper is organized as follows. In Section 2, we introduce multi-objective constrained optimization, the evolutionary algorithms ENORA and NSGA-II, and the well-known rule-based classifier learning systems PART, JRip, OneR and ZeroR. In Section 3, we describe the structure of an RBC for categorical data, and we propose the use of multi-objective optimization for the task of learning a classifier. In Section 4, we show the result of our experiments, performed on the well-known publicly accessible datasets Breast Cancer, Monk’s Problem 2, Tic-Tac-Toe-Endgame, Car, kr-vs-kp and Nursery. The experiments allow a comparison among the performance of the classifiers learned by our technique against those of classifiers learned by PART, JRip, OneR and ZeroR, as well as a comparison between ENORA and NSGA-II for the purposes of this task. In Section 5, the results are analyzed and discussed, before concluding in Section 6. Appendix A and Appendix B show the tables of the statistical tests results. Appendix C shows the symbols and the nomenclature used in the paper.

2. Background

2.1. Multi-Objective Constrained Optimization

The term optimization [27] refers to the selection of the best element, with regard to some criteria, from a set of alternative elements. Mathematical programming [28] deals with the theory, algorithms, methods and techniques to represent and solve optimization problems. In this paper, we are interested in a class of mathematical programming problems called multi-objective constrained optimization problems [29], which can be formally defined, for l objectives and m constraints, as follows:

\begin{matrix} M i n . / M a x . & f_{i} (x), & i = 1, \dots, l \\ s u b j e c t t o & g_{j} (x) \leq 0, & j = 1, \dots, m \end{matrix}

(1)

where $f_{i} (x)$ (usually called objectives) and $g_{j} (x)$ are arbitrary functions. Optimization problems can be naturally separated into two categories: those with discrete variables, which we call combinatorial, and those with continuous variables. In combinatorial problems, we are looking for objects from a finite, or countably infinite, set $X$ , where objects are typically integers, sets, permutations, or graphs. In problems with continuous variables, instead, we look for real parameters belonging to some continuous domain. In Equation (1), $x = \{x_{1}, x_{2}, \dots, x_{w}\} \in X^{w}$ represents the set of decision variables, where $X$ is the domain for each variable $x_{k}$ , $k = 1, \dots, w$ .

Now, let $F = {x \in X^{w} | g_{j} (x) \leq 0, j = 1, \dots, m}$ be the set of all feasible solutions to Equation (1). We want to find a subset of solutions $S \subseteq F$ called non-dominated set (or Pareto optimal set). A solution $x \in F$ is non-dominated if there is no other solution $x^{'} \in F$ that dominates $x$ , and a solution $x^{'}$ dominates $x$ if and only if there exists i ( $1 \leq i \leq l$ ) such that $f_{i} (x^{'})$ improves $f_{i} (x)$ , and for every i ( $1 \leq i \leq l$ ), $f_{i} (x)$ does not improve $f_{i} (x^{'})$ . In other words, $x^{'}$ dominates $x$ if and only if $x^{'}$ is better than $x$ for at least one objective, and not worse than $x$ for any other objective. The set $S$ of non dominated solutions of Equation (1) can be formally defined as:

S = \{x \in F ∣ ∄ x^{'} (x^{'} \in F \land D (x^{'}, x))\}

where:

D (x^{'}, x) = \exists i (1 \leq i \leq l, f_{i} (x^{'}) < f_{i} (x)) \land \forall i (1 \leq i \leq l, f_{i} (x^{'}) \leq f_{i} (x)) .

Once the set of optimal solutions is available, the most satisfactory one can be chosen by applying a preference criterion. When all the functions $f_{i}$ are linear, then the problem is a linear programming problem [30], which is the classical mathematical programming problem and for which extremely efficient algorithms to obtain the optimal solution exist (e.g., the simplex method [31]). When any of the functions $f_{i}$ is non-linear then we have a non-linear programming problem [32]. A non-linear programming problem in which the objectives are arbitrary functions is, in general, intractable. In principle, any search algorithm can be used to solve combinatorial optimization problems, although it is not guaranteed that they will find an optimal solution. Metaheuristics methods such as evolutionary algorithms [33] are typically used to find approximate solutions for complex multi-objective optimization problems, including feature selection and fuzzy classification.

2.2. The Multi-Objective Evolutionary Algorithms ENORA and NSGA-II

The MOEA ENORA [17] and NSGA-II [15] use a $(μ + λ)$ strategy (Algorithm 1) with $μ = λ = p o p s i z e$ , where $μ$ corresponds to the number of parents and $λ$ refers to the number of children ( $p o p s i z e$ is the population size), with binary tournament selection (Algorithm 2) and a rank function based on Pareto fronts and crowding (Algorithms 3 and 4). The difference between NSGA-II and ENORA is how the calculation of the ranking of the individuals in the population is performed. In ENORA, each individual belongs to a slot (as established in [34]) of the objective search space, and the rank of an individual in a population is the non-domination level of the individual in its slot. On the other hand, in NSGA-II, the rank of an individual in a population is the non-domination level of the individual in the whole population. Both ENORA and NSGA-II MOEAs use the same non-dominated sorting algorithm, the fast non-dominated sorting [35]. It compares each solution with the rest of the solutions and stores the results so as to avoid duplicate comparisons between every pair of solutions. For a problem with l objectives and a population with N solutions, this method needs to conduct $l \cdot N \cdot (N - 1)$ objective comparisons, which means that it has a time complexity of $O (l \cdot N^{2})$ [36]. However, ENORA distributes the population in N slots (in the best case), therefore, the time complexity of ENORA is $O (l \cdot N^{2})$ in the worst case and $O (l \cdot N)$ in the best case.

Algorithm 1

(μ + λ)

strategy for multi-objective optimization.

Require:

T > 1

{Number of generations}
Require:

N > 1

{Number of individuals in the population}

1:
Initialize P with N individuals
2:
Evaluate all individuals of P
3:
$t \leftarrow 0$
4:
while $t < T$ do
5:
$Q \leftarrow \emptyset$
6:
$i \leftarrow 0$
7:
while $i < N$ do
8:
Parent1← Binary tournament selection from P
9:
Parent2← Binary tournament selection from P
10:
Child1, Child2←Crossover(Parent1, Parent2)
11:
Offspring1←Mutation(Child1)
12:
Offspring2←Mutation(Child2)
13:
Evaluate Offspring1
14:
Evaluate Offspring2
15:
$Q \leftarrow Q ⋃ \{Offspring 1, Offspring 2\}$
16:
$i \leftarrow i + 2$
17:
end while
18:
$R \leftarrow P ⋃ Q$
19:
$P \leftarrow$ N best individuals from R according to the rank-crowding function in population R
20:
$t \leftarrow t + 1$
21:
end while
22:
return Non-dominated individuals from P

Codification for Rule Set					Codification for Adaptive Crossing and Mutation
Antecedents				Consequent	Associated Crossing	Associated Mutation
$b_{11}^{I}$	$b_{21}^{I}$	…	$b_{q 1}^{I}$	$c_{1}^{I}$
⋮	⋮	⋮	⋮	⋮	$d_{I}$	$e_{I}$
$b_{1 M_{I}}^{I}$	$b_{2 M_{I}}^{I}$	…	$b_{q M_{I}}^{I}$	$c_{M_{I}}^{I}$

#	Attribute Name	Type	Possible Values
1	age	categorical	10–19, 20–29, 30–39, 40–49, 50–59, 60–69, 70–79, 80–89, 90–99.
2	menopause	categorical	lt40, ge40, premeno
3	tumour-size	categorical	0–4, 5–9, 10–14, 15–19, 20–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59
4	inv-nodes	categorical	0–2, 3–5, 6–8, 9–11, 12–14, 15–17, 18–20, 21–23, 24–26, 27–29, 30–32, 33–35, 36–39
5	node-caps	categorical	yes, no
6	deg-malign	categorical	1, 2, 3
7	breast	categorical	left, right
8	breast-quad	categorical	left-up, left-low, right-up, right-low, central
9	irradiat	categorical	yes, no
10	class	categorical	no-recurrence-events, recurrence-events

#	Atttribute Name	Type	Possible Values
1	head_shape	categorical	round, square, octagon
2	body_shape	categorical	round, square, octagon
3	is_smiling	categorical	yes, no
4	holding	categorical	sword, balloon, flag
5	jacket_color	categorical	red, yellow, green, blue
6	has_tie	categorical	yes, no
7	class	categorical	yes, no

Method	Breast Cancer	Monk’s Problem 2
ENORA-ACC	244.92 s.	428.14 s.
ENORA-AUC	294.75 s.	553.11 s.
ENORA-RMSE	243.30 s.	414.42 s.
NSGA-II-ACC	127.13 s.	260.83 s.
NSGA-II-AUC	197.07 s.	424.83 s.
NSGA-II-RMSE	134.87 s.	278.19 s.

Learning Model	Number of Rules	Percent Correct	TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area	RMSE
ENORA-ACC	2	79.02	0.790	0.449	0.796	0.790	0.762	0.455	0.671	0.697	0.458
ENORA-AUC	2	75.87	0.759	0.374	0.751	0.759	0.754	0.402	0.693	0.696	0.491
ENORA-RMSE	2	77.62	0.776	0.475	0.778	0.776	0.744	0.410	0.651	0.680	0.473
NSGA-II-ACC	2	77.97	0.780	0.501	0.805	0.780	0.738	0.429	0.640	0.679	0.469
NSGA-II-AUC	2	75.52	0.755	0.368	0.749	0.755	0.752	0.399	0.693	0.696	0.495
NSGA-II-RMSE	2	79.37	0.794	0.447	0.803	0.794	0.765	0.467	0.673	0.700	0.454
PART	15	78.32	0.783	0.397	0.773	0.783	0.769	0.442	0.777	0.793	0.398
JRip	3	76.92	0.769	0.471	0.762	0.769	0.740	0.389	0.650	0.680	0.421
OneR	1	72.72	0.727	0.563	0.703	0.727	0.680	0.241	0.582	0.629	0.522
ZeroR	-	70.27	0.703	0.703	0.494	0.703	0.580	0.000	0.500	0.582	0.457

Rule	Antecedents							Consequent
$R_{1}$ :	IF	age = 50–59	AND	inv-nodes = 0–2	AND	node-caps = no
	AND	deg-malig = 1	AND	breast = right	AND	breast-quad = left-low	THEN	class = no-recurrence-events
$R_{2}$ :	IF	age = 60–69	AND	inv-nodes = 18–20	AND	node-caps = yes
	AND	deg-malig = 3	AND	breast = left	AND	breast-quad = right-up	THEN	class = recurrence-events

Rule	Antecedents							Consequent
$R_{1}$ :	IF	head_shape = round	AND	body_shape = round	AND	is_smiling = no
	AND	holding = sword	AND	jacket_color = red	AND	has_tie = yes	THEN	class = yes
$R_{2}$ :	IF	head_shape = octagon	AND	body_shape = round	AND	is_smiling = no
	AND	holding = sword	AND	jacket_color = red	AND	has_tie = no	THEN	class = yes
$R_{3}$ :	IF	head_shape = round	AND	body_shape = round	AND	is_smiling = no
	AND	holding = sword	AND	jacket_color = yellow	AND	has_tie = yes	THEN	class = yes
$R_{4}$ :	IF	head_shape = round	AND	body_shape = round	AND	is_smiling = no
	AND	holding = sword	AND	jacket_color = red	AND	has_tie = no	THEN	class = yes
$R_{5}$ :	IF	head_shape = square	AND	body_shape = square	AND	is_smiling = yes
	AND	holding = flag	AND	jacket_color = yellow	AND	has_tie = no	THEN	class = no
$R_{6}$ :	IF	head_shape = octagon	AND	body_shape = round	AND	is_smiling = yes
	AND	holding = balloon	AND	jacket_color = blue	AND	has_tie = no	THEN	class = no
$R_{7}$ :	IF	head_shape = octagon	AND	body_shape = octagon	AND	is_smiling = yes
	AND	holding = sword	AND	jacket_color = green	AND	has_tie = no	THEN	class = no

Learning Model	Percent Correct	ROC Area	Serialized Model Size
ENORA-ACC	73.45	0.61	9554.80
ENORA-AUC	70.16	0.62	9554.63
ENORA-RMSE	72.39	0.60	9557.77
NSGA-II-ACC	72.50	0.60	9556.20
NSGA-II-AUC	70.03	0.61	9555.70
NSGA-II-RMSE	73.34	0.60	9558.60
PART	68.92	0.61	55,298.13
JRip	71.82	0.61	7664.07
OneR	67.15	0.55	1524.00
ZeroR	70.30	0.50	915.00

Learning Model	Percent Correct	ROC Area	Serialized Model Size
ENORA-ACC	76.69	0.70	9586.50
ENORA-AUC	72.82	0.79	9589.30
ENORA-RMSE	75.66	0.68	9585.30
NSGA-II-ACC	70.07	0.59	9590.60
NSGA-II-AUC	67.08	0.70	9619.70
NSGA-II-RMSE	67.63	0.54	9565.90
PART	73.51	0.79	73,115.90
JRip	64.05	0.50	5956.90
OneR	65.72	0.50	1313.00
ZeroR	65.72	0.50	888.00

#	Attribute Name	Type	Possible Values
1	top-left-square	categorical	x, o, b
2	top-middle-square	categorical	x, o, b
3	top-right-square	categorical	x, o, b
4	middle-left-square	categorical	x, o, b
5	middle-middle-square	categorical	x, o, b
6	middle-right-square	categorical	x, o, b
7	bottom-left-square	categorical	x, o, b
8	bottom-middle-square	categorical	x, o, b
9	bottom-right-square	categorical	x, o, b
10	class	categorical	positive, negative

Learning Model	Number of Rules	Percent Correct	TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area	RMSE
ENORA-ACC	7	75.87	0.759	0.370	0.753	0.759	0.745	0.436	0.695	0.680	0.491
ENORA-AUC	7	68.71	0.687	0.163	0.836	0.687	0.687	0.523	0.762	0.729	0.559
ENORA-RMSE	7	77.70	0.777	0.360	0.777	0.777	0.762	0.481	0.708	0.695	0.472
NSGA-II-ACC	7	68.38	0.684	0.588	0.704	0.684	0.597	0.203	0.548	0.580	0.562
NSGA-II-AUC	7	66.38	0.664	0.175	0.830	0.664	0.661	0.497	0.744	0.715	0.580
NSGA-II-RMSE	7	68.71	0.687	0.591	0.737	0.687	0.595	0.226	0.548	0.583	0.559
PART	47	94.01	0.940	0.087	0.940	0.940	0.940	0.866	0.980	0.979	0.218
JRip	1	65.72	0.657	0.657	0.432	0.657	0.521	0.000	0.500	0.549	0.475
OneR	1	65.72	0.657	0.657	0.432	0.657	0.521	0.000	0.500	0.549	0.585
ZeroR	-	65.72	0.657	0.657	0.432	0.657	0.521	0.000	0.500	0.549	0.475

#	Attribute Name	Type	Possible Values
1	buying	categorical	vhigh, high, med, low
2	maint	categorical	vhigh, high, med, low
3	doors	categorical	2, 3, 4, 5-more
4	persons	categorical	2, 4, more
5	lug_boot	categorical	small, med, big
6	safety	categorical	low, med, high
7	class	categorical	unacc, acc, good, vgood

#	Attribute Name	Type	Possible Values
1	bkblk	categorical	t, f
2	bknwy	categorical	t, f
3	bkon8	categorical	t, f
4	bkona	categorical	t, f
5	bkspr	categorical	t, f
6	bkxbq	categorical	t, f
7	bkxcr	categorical	t, f
8	bkxwp	categorical	t, f
9	blxwp	categorical	t, f
10	bxqsq	categorical	t, f
11	cntxt	categorical	t, f
12	dsopp	categorical	t, f
13	dwipd	categorical	g, l
14	hdchk	categorical	t, f
15	katri	categorical	b, n, w
16	mulch	categorical	t, f
17	qxmsq	categorical	t, f
18	r2ar8	categorical	t, f
19	reskd	categorical	t, f
20	reskr	categorical	t, f
21	rimmx	categorical	t, f
22	rkxwp	categorical	t, f
23	rxmsq	categorical	t, f
24	simpl	categorical	t, f
25	skach	categorical	t, f
26	skewr	categorical	t, f
27	skrxp	categorical	t, f
28	spcop	categorical	t, f
29	stlmt	categorical	t, f
30	thrsk	categorical	t, f
31	wkcti	categorical	t, f
32	wkna8	categorical	t, f
33	wknck	categorical	t, f
34	wkovl	categorical	t, f
35	wkpos	categorical	t, f
36	wtoeg	categorical	n, t, f
37	class	categorical	won, nowin

#	Attribute Name	Type	Possible Values
1	parents	categorical	usual, pretentious, great_pret
2	has_nurs	categorical	proper, less_proper, improper, critical, very_crit
3	form	categorical	complete, completed, incomplete, foster
4	children	categorical	1, 2, 3, more
5	housing	categorical	convenient, less_conv, critical
6	finance	categorical	convenient, inconv
7	social	categorical	nonprob, slightly_prob, problematic
8	health	categorical	recommended, priority, not_recom
9	class	categorical	not_recom, recommend, very_recom, priority, spec_prior

Learning Model	Number of Rules	Percent Correct	TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area	RMSE
Monk’s problem 2
ENORA-ACC	7	77.70	0.777	0.360	0.777	0.777	0.762	0.481	0.708	0.695	0.472
PART	47	79.53	0.795	0.253	0.795	0.795	0.795	0.544	0.884	0.893	0.380
JRip	1	62.90	0.629	0.646	0.526	0.629	0.535	−0.034	0.478	0.537	0.482
OneR	1	65.72	0.657	0.657	0.432	0.657	0.521	0.000	0.500	0.549	0.586
ZeroR	-	65.72	0.657	0.657	0.432	0.657	0.521	0.000	0.491	0.545	0.457
Tic-Tac-Toe-Endgame
ENORA-ACC/RMSE	2	98.33	0.983	0.031	0.984	0.983	0.983	0.963	0.976	0.973	0.129
PART	49	94.26	0.943	0.076	0.942	0.943	0.942	0.873	0.974	0.969	0.220
JRip	9	97.81	0.978	0.031	0.978	0.978	0.978	0.951	0.977	0.977	0.138
OneR	1	69.94	0.699	0.357	0.701	0.699	0.700	0.340	0.671	0.651	0.548
ZeroR	-	65.35	0.653	0.653	0.427	0.653	0.516	0.000	0.496	0.545	0.476
Car
ENORA-RMSE	14	86.57	0.866	0.089	0.866	0.866	0.846	0.766	0.889	0.805	0.259
PART	68	95.78	0.958	0.016	0.959	0.958	0.958	0.929	0.990	0.979	0.1276
JRip	49	86.46	0.865	0.064	0.881	0.865	0.870	0.761	0.947	0.899	0.224
OneR	1	70.02	0.700	0.700	0.490	0.700	0.577	0.000	0.500	0.543	0.387
ZeroR	-	70.02	0.700	0.700	0.490	0.700	0.577	0.000	0.497	0.542	0.338
kr-vs-kp
ENORA-RMSE	10	94.87	0.949	0.050	0.950	0.949	0.949	0.898	0.950	0.927	0.227
PART	23	99.06	0.991	0.010	0.991	0.991	0.991	0.981	0.997	0.996	0.088
JRip	16	99.19	0.992	0.008	0.992	0.992	0.992	0.984	0.995	0.993	0.088
OneR	1	66.46	0.665	0.350	0.675	0.665	0.655	0.334	0.657	0.607	0.579
ZeroR	-	52.22	0.522	0.522	0.273	0.522	0.358	0.000	0.499	0.500	0.500
Nursery
ENORA-ACC	15	88.41	0.884	0.055	0.870	0.884	0.873	0.824	0.915	0.818	0.2153
PART	220	99.21	0.992	0.003	0.992	0.992	0.992	0.989	0.999	0.997	0.053
JRip	131	96.84	0.968	0.012	0.968	0.968	0.968	0.957	0.993	0.974	0.103
OneR	1	70.97	0.710	0.137	0.695	0.710	0.702	0.570	0.786	0.632	0.341
ZeroR	-	33.33	0.333	0.333	0.111	0.333	0.167	0.000	0.500	0.317	0.370

ML	Machine learning
ANN	Artificial neural networks
DLNN	Deep learning neural networks
CEO	Chief executive officer
SVM	Support vector machines
IBL	Instance-based learning
DT	Decision trees
RBC	Rule-based classifiers
ROC	Receiver operating characteristic
RMSE	Root mean square error performance metric
FL	Fuzzy logic
MOEA	Multi-objective evolutionary algorithms
NSGA-II	Non-dominated sorting genetic algorithm, 2nd version
ENORA	Evolutionary non-dominated radial slots based algorithm
PART	Partial decision tree classifier
JRip	RIPPER classifier of Weka
RIPPER	Repeated incremental pruning to produce error reduction
OneR	One rule classifier
ZeroR	Zero rule classifier
ENORA-ACC	ENORA with objective function defined as accuracy
ENORA-AUC	ENORA with objective function defined as area under the ROC curve
ENORA-RMSE	ENORA with RMSE objective function
NSGA-II-ACC	NSGA-II with objective function defined as accuracy
NSGA-II-AUC	NSGA-II with objective function defined as area under the ROC curve
NSGA-II-RMSE	NSGA-II with RMSE objective function
HVR	Hypervolume ratio
TP	True positive
FP	False positive
MCC	Matthews correlation coefficient
PRC	Precision-recall curve

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	0.5316	Not Rejected
ENORA-AUC	0.3035	Not Rejected
ENORA-RMSE	0.7609	Not Rejected
NSGA-II-ACC	0.1734	Not Rejected
NSGA-II-AUC	0.3802	Not Rejected
NSGA-II-RMSE	0.6013	Not Rejected
PART	0.0711	Not Rejected
JRip	0.5477	Not Rejected
OneR	0.316	Not Rejected
ZeroR	3.818 × 10 $^{- 06}$	Rejected

	ENORA-ACC	ENORA-AUC	ENORA-RMSE	NSGA-II-ACC	NSGA-II-AUC	NSGA-II-RMSE	PART	JRip	OneR
ENORA-AUC	0.2597	-	-	-	-	-	-	-	-
ENORA-RMSE	0.9627	0.9627	-	-	-	-	-	-	-
NSGA-II-ACC	0.9981	0.8047	1.0000	-	-	-	-	-	-
NSGA-II-AUC	0.2951	1.0000	0.9735	0.8386	-	-	-	-	-
NSGA-II-RMSE	1.0000	0.2169	0.9436	0.9960	0.2486	-	-	-	-
PART	0.1790	1.0000	0.9186	0.6997	1.0000	0.1461	-	-	-
JRip	0.9909	0.8956	1.0000	1.0000	0.9186	0.9840	0.8164	-	-
OneR	0.0004	0.6414	0.0451	0.0108	0.5961	0.0002	0.7546	0.0212	-
ZeroR	0.2377	1.0000	0.9538	0.7803	1.0000	0.1973	1.0000	0.8783	0.6709

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	0.6807	Not Rejected
ENORA-AUC	0.3171	Not Rejected
ENORA-RMSE	0.6125	Not Rejected
NSGA-II-ACC	0.0871	Not Rejected
NSGA-II-AUC	0.5478	Not Rejected
NSGA-II-RMSE	0.6008	Not Rejected
PART	0.6066	Not Rejected
JRip	0.2978	Not Rejected
OneR	0.4531	Not Rejected
ZeroR	0.0000	Rejected

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	5.042 × 10 $^{- 05}$	Rejected
ENORA-AUC	2.997 × 10 $^{- 07}$	Rejected
ENORA-RMSE	4.762 × 10 $^{- 04}$	Rejected
NSGA-II-ACC	4.88 × 10 $^{- 06}$	Rejected
NSGA-II-AUC	2.339 × 10 $^{- 07}$	Rejected
NSGA-II-RMSE	2.708 × 10 $^{- 06}$	Rejected
PART	0.3585	Not Rejected
JRip	9.086 × 10 $^{- 03}$	Rejected
OneR	1.007 × 10 $^{- 07}$	Rejected
ZeroR	0.0000	Rejected

	ENORA-ACC	ENORA-AUC	ENORA-RMSE	NSGA-II-ACC	NSGA-II-AUC	NSGA-II-RMSE	PART	JRip	OneR
ENORA-AUC	0.9998	-	-	-	-	-	-	-	-
ENORA-RMSE	0.0053	0.0004	-	-	-	-	-	-	-
NSGA-II-ACC	0.3871	0.0942	0.8872	-	-	-	-	-	-
NSGA-II-AUC	0.8872	0.4894	0.3871	0.9988	-	-	-	-	-
NSGA-II-RMSE	4.1 × 10 $^{- 05}$	1.3 × 10 $^{- 06}$	0.9860	0.2169	0.0244	-	-	-	-
PART	4.7 × 10 $^{- 09}$	5.6 × 10 $^{- 11}$	0.1973	0.0013	3.3 × 10 $^{- 05}$	0.8689	-	-	-
JRip	0.2712	0.6997	1.2 × 10 $^{- 08}$	7.0 × 10 $^{- 05}$	0.0025	6.3 × 10 $^{- 12}$	6.9 × 10 $^{- 14}$	-	-
OneR	0.0062	0.0546	1.5 × 10 $^{- 12}$	5.5 × 10 $^{- 08}$	5.5 × 10 $^{- 06}$	8.3 × 10 $^{- 14}$	8.3 × 10 $^{- 14}$	0.9584	-
ZeroR	1.9 × 10 $^{- 05}$	0.0004	7.3 × 10 $^{- 14}$	8.6 × 10 $^{- 12}$	2.3 × 10 $^{- 09}$	8.5 × 10 $^{- 14}$	<2 × 10 $^{- 16}$	0.2377	0.9584

PERMALINK

Multi-Objective Evolutionary Rule-Based Classification with Categorical Data

Fernando Jiménez

Carlos Martínez

Luis Miralles-Pechuán

Gracia Sánchez

Guido Sciavicco

Abstract

1. Introduction

2. Background

2.1. Multi-Objective Constrained Optimization

2.2. The Multi-Objective Evolutionary Algorithms ENORA and NSGA-II

Figure 1.

2.3. PART

2.4. JRip

2.5. OneR

2.6. ZeroR

3. Multi-Objective Optimization for Categorical Rule-Based Classification

3.1. Rule-Based Classification for Categorical Data

3.2. A Multi-Objective Optimization Solution

Figure 2.

3.2.1. Representation

Table 1.

3.2.2. Constraint Handling

3.2.3. Initial Population

3.2.4. Fitness Functions

3.2.5. Variation Operators

4. Experiment and Results

4.1. The Breast Cancer Dataset

Table 2.

4.2. The Monk’s Problem 2 Dataset

Table 3.

4.3. Optimization Models

4.4. Choosing the Best Pareto Front

Figure 3.

Figure 4.

Table 4.

4.5. Comparing Our Method with Other Classifier Learning Systems (Full Training Mode)

Table 5.

Table 6.

Table 7.

Table 8.

4.6. Comparing Our Method with Other Classifier Learning Systems (Cross-Validation and Train/Test Percentage Split Mode)

Table 9.

Table 10.

4.7. Additional Experiments

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

5. Analysis of Results and Discussion

6. Conclusions and Future Works

Acknowledgments

Abbreviations

Appendix A. Statistical Tests for Breast Cancer Dataset

Table A1.

Table A2.

Table A3.

Table A4.

Table A5.

Table A6.

Table A7.

Table A8.

Table A9.

Table A10.

Table A11.

Table A12.

Appendix B. Statistical Tests for Monk’s Problem 2 Dataset

Table A13.

Table A14.

Table A15.

Table A16.

Table A17.

Table A18.

Table A19.

Table A20.

Table A21.

Table A22.

Table A23.

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	0.6543	Not Rejected
ENORA-AUC	0.6842	Not Rejected
ENORA-RMSE	0.0135	Rejected
NSGA-II-ACC	0.979	Not Rejected
NSGA-II-AUC	0.382	Not Rejected
NSGA-II-RMSE	0.0486	Rejected
PART	0.5671	Not Rejected
JRip	0.075	Rejected
OneR	4.672 × 10 $^{- 06}$	Rejected
ZeroR	4.672 × 10 $^{- 06}$	Rejected

	ENORA-ACC	ENORA-AUC	ENORA-RMSE	NSGA-II-ACC	NSGA-II-AUC	NSGA-II-RMSE	PART	JRip	OneR
ENORA-AUC	0.8363	-	-	-	-	-	-	-	-
ENORA-RMSE	1.0000	0.9471	-	-	-	-	-	-	-
NSGA-II-ACC	0.1907	0.9902	0.3481	-	-	-	-	-	-
NSGA-II-AUC	0.0126	0.6294	0.0342	0.9958	-	-	-	-	-
NSGA-II-RMSE	0.0126	0.6294	0.0342	0.9958	1.0000	-	-	-	-
PART	0.8714	1.0000	0.9631	0.9841	0.5769	0.5769	-	-	-
JRip	2.1 × 10 $^{- 06}$	0.0048	1.0 × 10 $^{- 05}$	0.1341	0.6806	0.6806	0.0036	-	-
OneR	0.0001	0.0743	0.0006	0.6032	0.9875	0.9875	0.0601	0.9984	-
ZeroR	0.0001	0.0743	0.0006	0.6032	0.9875	0.9875	0.0601	0.9984	1.0000

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	0.4318	Not Rejected
ENORA-AUC	0.7044	Not Rejected
ENORA-RMSE	0.0033	Rejected
NSGA-II-ACC	0.3082	Not Rejected
NSGA-II-AUC	0.0243	Rejected
NSGA-II-RMSE	0.7802	Not Rejected
PART	0.1641	Not Rejected
JRip	0.3581	Not Rejected
OneR	0.0000	Rejected
ZeroR	0.0000	Rejected

Algorithm	p-Value	Null Hypothesis
ENORA-ACC	4.08 × 10 $^{- 05}$	Rejected
ENORA-AUC	0.0002	Rejected
ENORA-RMSE	0.0094	Rejected
NSGA-II-ACC	0.0192	Rejected
NSGA-II-AUC	0.0846	Rejected
NSGA-II-RMSE	0.0037	Rejected
PART	0.9721	Not Rejected
JRip	0.0068	Rejected
OneR	0.0000	Rejected
ZeroR	0.0000	Rejected

Symbol	Definition
Equation (1): Multi-objective constrained optimization
$x_{k}$	k-th decision variable
$x$	Set of decision variables
$f_{i} (x)$	i-th objective function
$g_{j} (x)$	j-th constraint
$l > 0$	Number of objectives
$m > 0$	Number of constraints
$w > 0$	Number of decision variables
$X$	Domain for each each decision variable $x_{k}$
$X^{w}$	Domain for the set of decision variables
$F$	Set of all feasible solutions
$S$	Set of non-dominated solutions or Pareto optimal set
$D (x^{'}, x)$	Pareto domination function
Equation (2): Rule-based classification for categorical data
$D$	Dataset
$x_{i}$	$i t h$ categorical input attribute in the dataset $D$
$x$	Categorical input attributes in the dataset $D$
y	Categorical output attribute in the dataset $D$
$\{1, \dots, v_{i}\}$	Domain of i-th categorical input attribute in the dataset $D$
$\{1, \dots, w\}$	Domain of categorical output attribute in the dataset $D$
$p \geq 0$	Number of categorical input attributes in the dataset $D$
$Γ$	Rule-based classifier
$R_{i}^{Γ}$	$i t h$ rule of classifier $Γ$
$b_{i j}^{Γ}$	Category for $j t h$ categorical input attribute and $i t h$ rule of classifier $Γ$
$c_{i}^{Γ}$	Category for categorical output attribute and $i t h$ rule of classifier $Γ$
$φ_{i}^{Γ} (x)$	Compatibility degree of the $i t h$ rule of classifier $Γ$ for the example $x$
$μ_{i j}^{Γ} (x)$	Result of the $i t h$ rule of classifier $Γ$ and $j t h$ categorical input attribute $x_{j}$
$λ_{c}^{Γ} (x)$	Association degree of classifier $Γ$ for the example $x$ with the class c
$η_{i c}^{Γ} (x)$	Result of of the $i t h$ rule of classifier $Γ$ for the example $x$ with the class c
$f_{Γ} (x)$	Classification or output of the classifier $Γ$ for the example $x$
Equation (3): Multi-objective constrained optimization problem for rule-based classification
$F_{D} (Γ)$	Performance objective function of the classifier $Γ$ in the dataset $D$
$NR (Γ)$	Number of rules of the classifier $Γ$
$M_{m a x}$	Maximum number of rules allowed for classifiers
Equations (4)–(6): Optimization models
${ACC}_{D} (Γ)$	Acurracy: proportion of correctly classified instances with the classifier $Γ$ in the dataset $D$
K	Number of instances in the dataset $D$
$T_{D} (Γ, i)$	Result of the classification of the $i t h$ instance in the dataset $D$ with the classifier $Γ$
${\hat{c}}_{i}^{Γ}$	Predicted value of the $i t h$ instance in the dataset $D$ with the classifier $Γ$
$c_{D}^{i}$	Corresponding true value for the $i t h$ instance in the dataset $D$ .
${AUC}_{D} (Γ)$	Area under the ROC curve obtained with the classifier $Γ$ in the dataset $D$ .
$S_{D} (Γ, t)$	Sensitivity: proportion of positive instances classified as positive with the classifier $Γ$ in the dataset $D$
$1 - E_{D} (Γ, t)$	Specificity: proportion of negative instances classified as negative with the classifier $Γ$ in the dataset $D$
t	Discrimination threshold
${RMSE}_{D} (Γ)$	Square root of the mean square error obtained with the classifier $Γ$ in the dataset $D$

Equations (7) and (8): Hypervolume metric
P	Population
$Q \subseteq P$	Set of non-dominated individuals of P
$v_{i}$	Volume of the search space dominated by the individual i
$H V (P)$	Hypervolume: volume of the search space dominated by population P
$H (P)$	Volume of the search space non-dominated by population P
$H V R (P)$	Hypervolume ratio: ratio of $H (P)$ over the volume of the entire search space
$V S$	Volume of the search space
$F_{D}^{l o w e r}$	Minimum value for objective $F_{D}$
$F_{D}^{u p p e r}$	Maximum value for objective $F_{D}$
${NR}^{l o w e r}$	Minimum value for objective $N R$
${NR}^{u p p e r}$	Maximum value for objective $N R$