An Improved Teaching-Learning-Based Optimization Algorithm with Reinforcement Learning Strategy for Solving Optimization Problems

Di Wu; Shuang Wang; Qingxin Liu; Laith Abualigah; Heming Jia

doi:10.1155/2022/1535957

. 2022 Mar 24;2022:1535957. doi: 10.1155/2022/1535957

An Improved Teaching-Learning-Based Optimization Algorithm with Reinforcement Learning Strategy for Solving Optimization Problems

Di Wu ¹, Shuang Wang ^2,^✉, Qingxin Liu ³, Laith Abualigah ^4,^5,⁶, Heming Jia ^2,^✉

PMCID: PMC8970903 PMID: 35371212

Abstract

This paper presents an improved teaching-learning-based optimization (TLBO) algorithm for solving optimization problems, called RLTLBO. First, a new learning mode considering the effect of the teacher is presented. Second, the Q-Learning method in reinforcement learning (RL) is introduced to build a switching mechanism between two different learning modes in the learner phase. Finally, ROBL is adopted after both the teacher and learner phases to improve the local optima avoidance ability of RLTLBO. These two strategies effectively enhance the convergence speed and accuracy of the proposed algorithm. RLTLBO is analyzed on 23 standard benchmark functions and eight CEC2017 test functions to verify the optimization performance. The results reveal that proposed algorithm provides effective and efficient performance in solving benchmark test functions. Moreover, RLTLBO is also applied to solve eight industrial engineering design problems. Compared with the basic TLBO and seven state-of-the-art algorithms, the results illustrate that RLTLBO has superior performance and promising prospects for dealing with real-world optimization problems. The source codes of the RLTLBO are publicly available at https://github.com/WangShuang92/RLTLBO.

1. Introduction

In recent years, real-world optimization problems have become increasingly complex and diverse in a wide range of fields and disciplines. Traditional (mathematical) optimization methods, such as Newton's method and the gradient descent method can no longer meet the needs for solving current optimization problems. Thus, nontraditional methods, especially metaheuristic algorithms, are becoming increasingly pervasive among researchers [1–3]. Metaheuristics are algorithms based on intuition or experience, that can provide a feasible solution at an acceptable cost (referring to computing time and computational resources), and the deviation between the feasible solution and the optimal solution may not be predicted in advance. Metaheuristic optimization algorithms have the merits of being flexible, having few parameters and avoiding local optima. Additionally, they can be rapidly deployed and thus have been utilized for solving various optimization problems over the past decades [4, 5]. Some of the most representative meta-heuristic algorithms are listed as follows: genetic algorithms (GA) [6], differential evolution algorithm (DE) [7], simulated annealing (SA) [8], arithmetic optimization algorithm (AOA) [9], heat transfer relation-based optimization algorithm (HTOA) [10], particle swarm optimization (PSO) [11], salp swarm algorithm (SSA) [12], grey wolf optimizer (GWO) [13], whale optimization algorithm (WOA) [14], aquila optimizer (AO) [15], remora optimization algorithm (ROA) [16], etc.

Teaching-learning-based optimization (TLBO) is a meta-heuristic algorithm proposed by Rao et al. in 2011 [17]. The TLBO method is inspired by the teaching-learning process in a class and simulates the influence of a teacher on learners. Due to the advantages of rapid convergence, absence of algorithm-specific parameters and easy implementation, TLBO has become a viral optimization algorithm and has been successfully applied to real-world problems in diverse fields. Aouf et al. [18] applied TLBO to optimize the parameters of the ANFIS structure to obtain the optimal trajectory and traveling time to address the navigation problem of the mobile robot in a strange environment. Singh et al. [19] studied the application of TLBO for optimal coordination of directional overcurrent relays (DOCRs) in a looped power system. Multiobjective TLBO was applied to solve the motif discovery problem (MDP) in the bioinformatics field by Gonzalez-Alvarez et al. [20], and obtained better solutions than other biology-based multiobjective evolutionary algorithms. All the above applications have suggested that TLBO can be effectively applied to many optimization problems in various fields.

The improvement and hybrid algorithms of TLBO and their applications have also been studied by several researchers [21]. Kumar and Singh [22] developed a chaotic version of TLBO with different chaotic mechanisms. A local search method was also incorporated to guide the search direction between local and global search and to improve the quality of solution. The application of clustering problems proved the effectiveness of this algorithm. Taheri et al. [23] proposed a balanced TLBO with three modifications, called BTLBO. A weighted mean replaced the mean value in the teacher phase to maintain the diversity. The tutoring phase was added as a powerful local search mechanism for exploiting regions around the best solution. The restarting phase was introduced to improve the exploration ability by replacing inactive learners with randomly initialized learners. Ma et al. [24] proposed a modified TLBO (MTLBO) by introducing a population group mechanism into the basic TLBO. All students were divided into two groups and updated by different updating strategies. The MTLBO was also applied to establish the NOx emission model of a circulation fluidized bed boiler. Xu et al. [25] introduced dynamic-opposite learning (DOL) strategy into TLBO to overcome premature convergence. The asymmetric search space and the dynamic change in the characteristics of DOL help DOLTLBO to holistically improve the exploitation and exploration capabilities. Dong et al. [26] presented a KTLBO algorithm to achieve computationally expensive constrained optimization. The kriging-assisted two-phase optimization framework was used to alternately conduct global and local searches, achieving the search acceleration. KTLBO was also adopted to design the structure of a blended-wing-body underwater glider. Ren et al. [27] developed a multiobjective elitist feedback TLBO (MEFTO) for multiobjective optimization problems. The elitism strategy was used to store the best solutions obtained thus far. The proposed feedback phase allowed students to choose whether to study directly with the teacher or to motivate themselves, providing a novel way for students to improve themselves. Zhang et al. [28] proposed a hybrid algorithm based on TLBO and a neural network algorithm (NNA) named TLNNA to solve engineering optimization problems. The experimental results suggested that TLNNA has improved global search ability and fast convergence speed. By considering the features of the WOA and TLBO, Lakshmi and Mohanaiah [29] proposed a hybrid WOA-TLBO algorithm. This was also applied to solve the facial emotion recognition (FER) functional problem, and the reported results showed its effectiveness and high accuracy.

The TLBO variants proposed previously have improved searchability and accelerated the convergence process, but they still struggle with premature convergence and insufficient learning processes. Thus, in this paper we propose an improved TLBO algorithm to solve industrial engineering optimization problems. Given the characteristics of TLBO, reinforcement learning (RL) in machine learning is introduced to the learner phase, and enables the algorithm to choose a more suitable learning mode, which can train the search agents to perform more beneficial actions. In addition, a random opposition-based learning (ROBL) strategy is added after the whole learner phase to facilitate the convergence acceleration and avoid local optima. The proposed improved TLBO with RH and ROBL strategies is called RLTLBO. The standard and CEC2017 benchmark functions and eight engineering design problems are used to test the exploration and exploitation capabilities of the proposed method. The RLTLBO algorithm is compared with some existing algorithms, including the basic TLBO and the Salp Swarm Algorithm (SSA), which are considered the classical algorithms, the Aquila Optimizer (AO), Harris Hawks Optimization (HHO) [30], and Horse herd Optimization Algorithm (HOA) [31], which are the recent new methods, and the memory-based Grey Wolf Optimizer (mGWO) [32], modified Ant Lion Optimizer (MALO) [33] and dynamic Sine Cosine Algorithm (DSCA) [34], which are the latest improved algorithms. The experimental results show that the proposed RLTLBO method is superior to the state-of-the-art algorithms in exploration and exploitation capabilities. Moreover, eight industrial engineering design problems are applied to evaluate the effectiveness of the algorithm when solving real-world optimization problems.

The rest of this paper is organized as follows: Section 2 provides a brief overview of the basic TLBO, RL, and ROBL strategies. Section 3 describes the proposed RLTLBO algorithm in detail. Simulations, experiments and an analysis of the results are presented in Section 4. Section 5 describes industrial engineering design problems. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Teaching-Learning-Based Optimization

The TLBO algorithm mimics the influence of a teacher on the output of learners, which can be reflected by learners' grades. As a highly learned person, the teacher gives their knowledge to the learners. The outcome of the learners is affected by the quality of the teacher. It is obvious that learners trained by a good teacher can achieve better results in terms of their grades. The optimization process of TLBO is divided into two phases: the teacher phase and the learner phase.

2.1.1. Teacher Phase

The teacher phase simulates the teaching process of a teacher. The best one in the class is selected as the teacher, and then the teacher tries their best to improve the overall level of the class. The teaching process can be formulated as follows:

\begin{matrix} X_{new} = X_{old} + rand (X_{teacher} - T_{F} \cdot Mean), \end{matrix}

(1)

where Xnew and Xold represent the positions of the individual after and before learning, that is, the candidate solutions after and before updating. Xteacher is the position of the teacher, which is the best individual of the population. Mean indicates the average level of search agents in the population. TF is a teaching factor that determines the change of the mean value, and rand is a random number between 0 and 1. The value of TF can be either 1 or 2, which is a heuristic step and randomly decided with equal probability as TF = round (1 + rand (0, 1){2–1}).

2.1.2. Learner Phase

In addition to learning new knowledge from the teacher, learners can also increase knowledge through interaction. In the mutual learning process, a learner can randomly learn knowledge from another learner with a better grade randomly. The expression of the learner phase can be written as follows:

\begin{matrix} X_{new} = \{\begin{matrix} X_{old} + rand (X_{r 1} - X_{r 2}) f (X_{r 1}) < f (X_{r 2}) \\ X_{old} + rand (X_{r 2} - X_{r 1}) otherwise \end{matrix}, \end{matrix}

(2)

where Xr1 and Xr2 indicate the positions of two learners randomly selected from the population. f (·) is the fitness value. The comparison between two learners determines the learning direction. The individual with a poor grade learns from the individual with a better grade. The new individual with improvements after learning will be accepted, otherwise rejected.

The flow chart of the TLBO algorithm is shown in Figure 1.

2.2. Reinforcement Learning (RL)

Machine learning algorithms are also widely used to solve various optimization problems [35]. Machine learning methods generally consist of four categories, as shown in Figure 2: supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning (RL). In RL algorithms, the agent is trained to learn optimal actions in a complex environment. The agent is trained in different ways and uses its training experience in the subsequent actions. RL methods generally consist of model-free and model-based approaches. The model-free approaches can be divided into two subgroups: value-based and policy-based methods. The value-based algorithms are convenient for coordinating with meta-heuristic algorithms because they are model-free and policy-free, providing higher flexibility [36]. In the value-based RL approaches, the reinforcement agent learns from its actions and experience in the environment, such through reward and penalty. The agent measures the success of the action in completing the task goal through the reward penalty and then makes a decision based on its achievement.

Classification of the reinforcement learning algorithms.

The Q-Learning method is one of the representative algorithms among the value-based RL methods. In the Q-Learning method, the agent takes random actions and then obtains a reward or penalty. An experience is gradually constructed based on the agent's actions. Throughout process of building experience, a table called Q-Table is defined [37]. The agent considers all possible actions and tries to update its state according to the Q-Table values to select the best action that maximizes the current state's maximal rewards. Therefore, the agent in action determines whether to explore or exploit the environment.

Compared to RL methods, meta-heuristic algorithms often require deep expert knowledge to establish the balance between different phases. RL methods can help discover optimal designs of parameters and more balanced strategies allowing the algorithm to switch between the exploration and exploitation phases. Metaheuristic methods usually operate with specific policies in certain situations, and thus, the dynamism is lower than that of RL algorithms, especially value-based methods. The agent in the value-based methods is online and operates beneficial actions through a reward-penalty mechanism without following any policy. Many types of research have been presented in the literature regarding the combination of meta-heuristics and RL [38–44].

2.3. Random Opposition-Based Learning (ROBL)

Random opposition-based learning (ROBL) is a variant of opposition-based learning (OBL) [45] proposed by Long et al. in 2019 [46]. OBL is a powerful optimization tool that simultaneously considers the fitness of an estimate and its corresponding opposite estimate to achieve a better candidate solution. In contrast from the basic OBL, ROBL utilizes a random term to improve the OBL strategy, which is defined as follows:

\begin{matrix} {\hat{x}}_{j} = l_{j} + u_{j} - rand \times x_{j}, j = 1, 2, \dots, n, \end{matrix}

(3)

where ${\hat{x}}_{j}$ and x_j indicate the opposite and original solutions, u_j and l_j are the upper and lower bound of the problem in jth dimension. The opposite solution is randomly selected in the opposite half of the search space. This solution is not only opposite, but also random, with a wider range of distributions. An example of ROBL solutions is shown in Figure 3. The opposite solution with a random term described by equation (3) is more stochastic than the basic OBL and can effectively help the algorithm jump out of the local optima.

Example of ROBL solutions. Three sets of solutions (original solution, corresponding opposite solution (xobl), and random opposite solution (xrobl)) are labeled in a two-dimensional search space. The random opposite solutions are not only in the symmetric positions, but also with a wider range of distributions.

3. The Proposed RLTLBO Algorithm

3.1. New Learning Mode

The basic TLBO algorithm performs the learner phase after the teacher phase in each iteration. The search agent learns from other individuals in the learner phase. However, in the actual learning process, students learning from each other varies from person to person. Different students might choose different learning modes, such as formal communications, group discussions, presentations, etc. Moreover, the students might adjust the learning mode according to their learning situation during the learning process. Therefore, in this paper, we introduce another learning mode to diversify the learning methods of the students, which can be described in the equations as follows:

\begin{matrix} X_{new} = \{\begin{matrix} X_{old} + rand [(1 - \frac{t}{T}) X_{r 3} + (\frac{t}{T}) X_{teacher} - X_{old}] f (X_{r 3}) < f (X_{old}) \\ X_{old} + rand [(1 - \frac{t}{T}) X_{old} + (\frac{t}{T}) X_{teacher} - X_{r 3}] otherwise \end{matrix}, \end{matrix}

(4)

where Xr3 is the position of a learner randomly selected from the population. t and T are the current and maximum number of iterations.

In this learning mode, the effect of the teacher is introduced. Sometimes the mutual learning between students is not always beneficial, and the partial intervention of the teacher is more helpful to students' improvement. Students will not only learn from each other but also ask the teacher for help. At the beginning of the iterations, the weight of mutual learning among students is larger, and the algorithm pays more attention to random learning, which can maintain population diversity and increase global searchability. In the later iteration stage, students consult more from the teacher and approach the teacher, enhancing the algorithms local searchability.

3.2. Learner Phase with RL Strategy

To enable students to adjust their learning mode more effectively, Q-Learning in RL is introduced to complete the switching between both learning modes. The student uses Q-Table values as a guide to decide between different learning modes. The Q-table is updated using a reward-penalty mechanism. The student selects the best state by calculating the benefit degree of each possible state and taking the leaning mode with the highest Q-values for the next step. The student obtains a reward or a penalty according to its actions after each step. The general pattern of the RL agent and environmental framework is shown in Figure 4.

Reinforcement learning agent and environment framework. at represents the current action st and st + 1 indicate the current and the next state, rt and rt + 1 indicate the current and the next reward, respectively.

In the Q-Learning method, a reward table is used to reward or penalize the agent for its action or state compositions, which users can provide. The reward table in this work contains the positive (+1) or negative (−1) rewards for each state and action couple. The Q-Table can be considered the agents experience, which should be assigned a zero value for all units in the beginning. Consequently, the student updates Q-Table using the Bellman equation (5) and prepares the Q-Table for the next iteration [44].

\begin{matrix} Q_{(t + 1)} (s_{t}, a_{t}) \leftarrow Q_{t} (s_{t}, a_{t}) + λ [r_{t + 1} + γ Max Q_{t_{(s_{t + 1}, a)}} - Q_{t} (s_{t}, a_{t})], \end{matrix}

(5)

where st and st + 1 indicate the current and the next state respectively, Qt and Qt + 1 are the current Q-value and pre-estimated Q-value for the next state st + 1, and at represents the current action. λ and γ are the learning rate value and discount factor, respectively, which are numbers between 0 and 1. The learning rate determines how fast the algorithm should learn and controls the convergence of the learning process. The discount factor defines how much the algorithm learns from the mistake and controls the importance of future rewards. rt + 1 indicates the immediate reward or penalty an agent gets for taking current action.

In each iteration, the agent uses equation (5) to calculate and weight each possible state and action for the next step, before choosing the best action (learning mode 1 or learning mode (2) with the highest likelihood to get closer to the best optimal solution. Examples of the reward table and Q-Table are displayed in Figure 5. This RL strategy helps establish a switching mechanism between different learning modes in the learner phase and find the most suitable decision scheme. Four optional actions can occur as listed below:

When the student is learning in learning mode 1, they still decides to stay in learning mode 1
When the student is learning in learning mode 2, they still decides to stay in learning mode 2
When the student learns in learning mode 1, they decides to transition to learning mode 2
When the student learns in learning mode 1, they decides to transition to learning mode 2

The reward table and Q-Table example of RLTLBO. (a) Reward Table sample (b) Q-Table sample.

The most important value of the RL strategy is to help the algorithm switch between different learning modes as and when needed during the learner phase. For the above reason, the algorithm can find better solutions faster and more effectively in the search space, considerably increasing the search efficiency. Therefore, the convergence speed of the algorithm can be improved effectively.

3.3. The Detail of RLTLBO

In the improved TLBO algorithm, the teacher phase of basic TLBO is carried out first. Then, the learner phase with RL strategy is implemented to achieve effective and efficient investigation in the search space. Finally, ROBL is added to enhance the ability of local optima avoidability. The random opposite solution increases the probability of the algorithm finding a better solution. This variant of TLBO, which incorporates RL, is named RLTLBO. The pseudocode and the flowchart of the proposed RLTLBO algorithm are shown in Algorithm 1 and Figure 6, respectively.

3.4. Computational Complexity Analysis

RLTLBO mainly consists of three components: initialization, fitness evaluation, and position updating. In the initialization phase, the computational complexity of positions generated is O(N). Then, the computational complexity of fitness evaluation for the solution is O(2 × N) during the iteration process. Finally, we utilize ROBL to keep the algorithm from falling into local optima. Thus, the computational complexities of position updating of RLTLBO is O(2 × N × D), where D is the dimension size of the problem. Therefore, the total computational complexity of the proposed RLTLBO algorithm is O(3 × N + 2 × N × D).

4. Numerical Experiments and Results

In this section, two different kinds of benchmark functions are performed to evaluate the performance of the proposed RLTLBO algorithm. Standard benchmark functions are first tested to assess the algorithm in solving twenty-three simple numerical problems. Then, the CEC2017 benchmark functions are utilized to evaluate the algorithm in solving complex numerical problems. The RLTLBO is compared with three types of existing algorithms, including the classic methods, TLBO and SSA, the recently proposed algorithms, HOA [31], AO, and HHO [30], and the improved algorithms, mGWO [32], MALO [33] and DSCA [34]. For the consistency of all tests, we set the population size to N = 30, the dimension size to D = 30, and the maximum number of iterations to T = 500. All algorithms are run 30 times independently, and the average values and standard deviations are presented as the final experimental results. All experiments are implemented in MATLAB R2020b on a PC with Intel (R) Core (TM) i5-9500 CPU @ 3.00 GHz and RAM 16 GB memory on OS windows 10.

4.1. Standard Benchmark Function Experiments

Standard benchmark functions [47] can be divided into three types: unimodal, multimodal and fixed-dimension multimodal functions. Unimodal functions only have one global optimum and no local optima, which can be used to evaluate an algorithm's convergence rate and exploitation capability. Multimodal and fixed-dimension multimodal functions have a global optimum and multiple local optima. This characteristic makes these functions effective for testing the exploration and local optima avoidance abilities of an algorithm. The benchmark function details are listed in Tables 1–3.

Table 1.

Unimodal benchmark functions.

Function	Dim	Range
F ₁(x)=∑_i=1ⁿx_i²	30	[−100, 100]
F ₂(x)=∑_i=1ⁿ\|x_i\|+∏_i=1ⁿ\|x_i\|	30	[−10, 10]
F ₃(x)=∑_i=1ⁿ(∑_j−1ⁱx_j)²	30	[−100, 100]
F ₄(x)=max_i{\|x_i\|, 1 ≤ i ≤ n}	30	[−100, 100]
F ₅(x)=∑_i=1ⁿ⁻¹[100(x_i+1 − x_i²)²+(x_i − 1)²]	30	[−30, 30]
F ₆(x)=∑_i=1ⁿ(x_i+5)²	30	[−100, 100]
F ₇(x)=∑_i=1ⁿix_i⁴+random[0,1)	30	[−1.28, 1.28]

Function	Dim	Range	f _min
$F_{8} (x) = \sum_{i = 1}^{n} - x_{i} \sin (\sqrt{\|x_{i}\|})$	30	[−500, 500]	−418.9829 × Dim
F ₉(x)=∑_i=1ⁿ[x_i² − 10 cos(2πx_i)+10]	30	[−5.12, 5.12]	0
$F_{10} (x) = - 20 \exp (- 0.2 \sqrt{1 / n \sum_{i = 1}^{n} x_{i}^{2}}) - \exp (1 / n \sum_{i = 1}^{n} \cos (2 π x_{i})) + 20 + e$	30	[−32, 32]	0
$F_{11} (x) = 1 / 4000 \sum_{i = 1}^{n} x_{i}^{2} - \prod_{i = 1}^{n} \cos (x_{i} / \sqrt{i}) + 1$	30	[−600, 600]	0
$\begin{matrix} F_{12} (x) = \frac{π}{n} \{10 \sin (π y_{1}) + {\sum_{i = 1}^{n - 1} (y_{i} - 1)}^{2} [1 + 10 \sin^{2} (π y_{i + 1})] + {(y_{n} - 1)}^{2}\} \\ + \sum_{i = 1}^{n} u (x_{i}, 10,100,4),where y_{i} = 1 + x_{i} + 1 / 4, \\ u (x_{i}, a, k, m) = \{\begin{matrix} k {(x_{i} - a)}^{m} & x_{i} > a \\ 0 & - a < x_{i} < a \\ k {(- x_{i} - a)}^{m} & x_{i} < - a \end{matrix} \end{matrix}$	30	[−50, 50]	0
$\begin{matrix} F_{13} (x) = 0.1 (\sin^{2} (3 π x_{1}) + \sum_{i = 1}^{n} {(x_{i} - 1)}^{2} [1 + \sin^{2} (3 π x_{i} + 1)] + {(x_{n} - 1)}^{2} [1 + \sin^{2} (2 π x_{n})]) \\ + \sum_{i = 1}^{n} u (x_{i}, 5,100,4) \end{matrix}$	30	[−50, 50]	0

Function	Dim	Range	f _min
F ₁₄(x)=(1/500+∑_j=1²⁵1/j+∑_i=1²(x_i − a_ij)⁶)⁻¹	2	[−65, 65]	0.998
F ₁₅(x)=∑_i=1¹¹[a_i − x₁(b_i²+b_ix₂)/b_i²+b_ix₃+x₄]²	4	[−5, 5]	0.00030
F ₁₆(x)=4x₁² − 2.1x₁⁴+1/3x₁⁶+x₁x₂ − 4x₂²+x₂⁴	2	[−5, 5]	−1.0316
F ₁₇(x)=(x₂ − 5.1/4π²x₁²+5/πx₁ − 6)²+10(1 − 1/8π)cos x₁+10	2	[−5, 5]	0.398
$\begin{matrix} F_{18} (x) = [1 + {(x_{1} + x_{2} + 1)}^{2} (19 - 14 x_{1} + 3 x_{1}^{2} - 14 x_{2} + 6 x_{1} x_{2} + 3 x_{2}^{2})] \\ \times [30 + {(2 x_{1} - 3 x_{2})}^{2} \times (18 - 32 x_{2} + 12 x_{1}^{2} + 48 x_{2} - 36 x_{1} x_{2} + 27 x_{2}^{2})] \end{matrix}$	2	[−2, 2]	3
F ₁₉(x)=−∑_i=1⁴c_iexp(−∑_j=1³a_ij(x_j − p_ij)²)	3	[−1, 2]	−3.86
F ₂₀(x)=−∑_i=1⁴c_iexp(−∑_j=1⁶a_ij(x_j − p_ij)²)	6	[0, 1]	−3.32
F ₂₁(x)=−∑_i=1⁵[(X − a_i)(X − a_i)^T+c_i]⁻¹	4	[0, 10]	−10.1532
F ₂₂(x)=−∑_i=1⁷[(X − a_i)(X − a_i)^T+c_i]⁻¹	4	[0, 10]	−10.4028
F ₂₃(x)=−∑_i=1¹⁰[(X − a_i)(X − a_i)^T+c_i]⁻¹	4	[0, 10]	−10.5363

Function		RLTLBO	TLBO	mGWO	MALO	DSCA	HOA	AO	HHO	SSA
F1	Mean	0.00E + 00	3.90E − 79	4.26E − 19	1.37E − 03	2.55E − 288	3.13E − 136	2.34E − 104	8.97E − 98	1.30E − 07
F1	Std	0.00E + 00	6.59E − 79	1.08E − 18	1.56E − 03	0.00E + 00	1.21E − 135	1.08E − 103	4.16E − 97	1.09E − 07
F2	Mean	1.29E − 223	4.17E − 40	3.37E − 12	6.86E + 01	5.92E − 171	4.44E − 68	2.82E − 53	1.34E − 48	1.79E + 00
F2	Std	0.00E + 00	3.21E − 40	2.54E − 12	4.90E + 01	0.00E + 00	2.42E − 67	1.13E − 52	5.75E − 48	1.15E + 00
F3	Mean	0.00E + 00	2.50E − 17	6.41E − 01	4.81E + 03	1.43E − 241	2.23E + 02	2.22E − 101	7.16E − 79	1.61E + 03
F3	Std	0.00E + 00	4.35E − 17	1.46E + 00	2.18E + 03	0.00E + 00	5.03E + 02	1.22E − 100	3.56E − 78	1.03E + 03
F4	Mean	3.07E − 221	1.72E − 32	2.42E − 03	1.64E + 01	1.97E − 134	5.04E − 65	3.20E − 53	2.51E − 48	1.11E + 01
F4	Std	0.00E + 00	1.76E − 32	3.02E − 03	4.23E + 00	1.08E − 133	1.84E − 64	1.75E − 52	8.46E − 48	3.74E + 00
F5	Mean	2.65E + 01	2.42E + 01	2.64E + 01	9.86E − 01	2.85E + 01	2.89E + 01	6.82E − 03	1.22E − 02	2.55E + 02
F5	Std	4.01E − 01	7.41E − 01	8.44E − 01	5.21E + 00	3.59E − 01	7.45E − 02	1.66E − 02	1.79E − 02	3.44E + 02
F6	Mean	9.03E − 02	2.57E − 06	4.54E − 01	5.00E − 04	6.01E + 00	6.46E + 00	4.43E − 05	9.58E − 05	1.28E − 07
F6	Std	1.15E − 01	7.98E − 06	3.20E − 01	3.05E − 04	1.61E − 01	4.76E − 01	6.15E − 05	1.24E − 04	1.13E − 07
F7	Mean	3.57E − 05	1.12E − 03	4.61E − 03	1.05E − 04	2.54E − 04	5.88E − 02	9.62E − 05	1.68E − 04	1.81E − 01
F7	Std	4.71E − 05	3.06E − 04	1.64E − 03	7.89E − 05	2.88E − 04	4.10E − 02	7.92E − 05	1.36E − 04	8.96E − 02
F8	Mean	−7.36E + 03	−7.85E + 03	−6.58E + 03	−1.22E + 04	−3.96E + 03	−4.30E + 03	−8.92E + 03	−1.25E + 04	−7.56E + 03
F8	Std	6.78E + 02	9.32E + 02	1.24E + 03	1.08E + 03	4.31E + 02	7.82E + 02	3.77E + 03	8.42E + 01	7.07E + 02
F9	Mean	0.00E + 00	1.41E + 01	1.70E + 01	8.44E + 01	0.00E + 00	5.06E + 01	0.00E + 00	0.00E + 00	5.19E + 01
F9	Std	0.00E + 00	6.20E + 00	9.11E + 00	3.15E + 01	0.00E + 00	9.32E + 01	0.00E + 00	0.00E + 00	1.88E + 01
F10	Mean	8.88E − 16	7.05E − 15	1.14E + 00	4.77E + 00	8.88E − 16	6.10E − 15	8.88E − 16	8.88E − 16	2.62E + 00
F10	Std	0.00E + 00	1.60E − 15	1.88E + 00	2.64E + 00	0.00E + 00	2.42E − 15	0.00E + 00	0.00E + 00	8.98E − 01
F11	Mean	0.00E + 00	3.29E − 04	4.86E − 03	6.05E − 02	0.00E + 00	1.18E − 01	0.00E + 00	0.00E + 00	2.24E − 02
F11	Std	0.00E + 00	1.80E − 03	9.13E − 03	2.33E − 02	0.00E + 00	2.57E − 01	0.00E + 00	0.00E + 00	1.45E − 02
F12	Mean	8.32E − 04	5.38E − 07	3.51E − 02	1.60E − 05	8.37E − 01	1.23E + 00	3.04E − 06	1.02E − 05	7.22E + 00
F12	Std	1.52E − 03	2.76E − 06	4.56E − 02	1.16E − 05	1.08E − 01	2.42E − 01	4.59E − 06	1.12E − 05	3.01E + 00
F13	Mean	2.00E + 00	7.41E − 02	3.83E − 01	1.70E − 03	2.76E + 00	3.08E + 00	4.57E − 05	8.69E − 05	2.19E + 01
F13	Std	1.17E + 00	8.70E − 02	2.15E − 01	3.95E − 03	5.11E − 02	1.83E − 01	1.18E − 04	9.70E − 05	1.44E + 01
F14	Mean	1.06E + 00	9.98E − 01	9.98E − 01	1.46E + 00	1.35E + 00	2.78E + 00	4.06E + 00	1.36E + 00	1.16E + 00
F14	Std	3.62E − 01	0.00E + 00	3.81E − 12	7.69E − 01	6.1E − 01	2.07E + 00	4.46E + 00	9.52E − 01	4.57E − 01
F15	Mean	3.55E − 04	3.82E − 04	3.04E − 03	1.40E − 03	8.91E − 04	6.77E − 03	5.00E − 04	4.01E − 04	3.55E − 03
F15	Std	1.02E − 04	1.54E − 04	6.91E − 03	3.62E − 03	3.99E − 04	5.47E − 03	1.10E − 04	2.36E − 04	6.71E − 03
F16	Mean	−1.03E + 00	−1.03E + 00	−1.03E + 00	−1.03E + 00	−1.03E + 00	−9.99E − 01	−1.03E + 00	−1.03E + 00	−1.03E + 00
F16	Std	6.58E − 16	6.95E − 16	3.39E − 08	1.65E − 13	3.99E − 04	3.29E − 02	3.01E − 04	3.76E − 09	1.83E − 14
F17	Mean	3.98E − 01	3.98E − 01	3.98E − 01	3.98E − 01	4.09E − 01	3.99E − 01	3.98E − 01	3.98E − 01	3.98E − 01
F17	Std	0.00E + 00	0.00E + 00	6.52E − 09	5.57E − 14	1.06E − 02	1.08E − 03	1.09E − 04	4.60E − 06	7.21E − 15
F18	Mean	3.00E + 00	3.00E + 00	3.00E + 00	3.00E + 00	3.00E + 00	4.94E + 00	3.03E + 00	3.00E + 00	3.00E + 00
F18	Std	4.95E − 16	1.24E − 15	1.03E − 07	5.76E − 13	8.33E − 04	6.82E + 00	5.73E − 02	3.88E − 07	2.87E − 13
F19	Mean	−3.86E + 00	−3.86E + 00	−3.86E + 00	−3.86E + 00	−3.82E + 00	−3.86E + 00	−3.85E + 00	−3.86E + 00	−3.86E + 00
F19	Std	2.71E − 15	3.16E − 15	1.08E − 06	6.39E − 13	2.33E − 02	6.99E − 04	6.96E − 03	2.07E − 03	1.09E − 12
F20	Mean	−3.31E + 00	−3.30E + 00	−3.23E + 00	−3.23E + 00	−2.80E + 00	−3.25E + 00	−3.16E + 00	−3.08E + 00	−3.23E + 00
F20	Std	2.95E − 02	4.12E − 02	6.47E − 02	5.14E − 02	2.71E − 01	9.05E − 02	8.91E − 02	1.22E − 01	6.22E − 02
F21	Mean	−1.02E + 01	−1.02E + 01	−9.98E + 00	−7.62E + 00	−3.27E + 00	−9.43E + 00	−1.01E + 01	−5.18E + 00	−8.07E + 00
F21	Std	6.04E − 09	1.41E − 03	9.30E − 01	2.82E + 00	1.54E + 00	9.62E − 01	2.09E − 02	7.51E − 01	3.28E + 00
F22	Mean	−1.04E + 01	−1.01E + 01	−1.04E + 01	−7.06E + 00	−3.87E + 00	−9.36E + 00	−1.04E + 01	−5.08E + 00	−9.32E + 00
F22	Std	1.23E − 07	1.25E + 00	4.45E − 04	3.48E + 00	1.17E + 00	1.69E + 00	5.50E − 02	6.94E − 03	2.51E + 00
F23	Mean	−1.05E + 01	−1.01E + 01	−1.05E + 01	−7.31E + 00	−4.19E + 00	−9.63E + 00	−1.05E + 01	−5.24E + 00	−7.89E + 00
F23	Std	1.57E − 07	1.57E + 00	3.42E − 04	3.55E + 00	1.11E + 00	1.52E + 00	2.23E − 02	9.58E − 01	3.59E + 00

Function	RLTLBO vs.
Function	TLBO	mGWO	MALO	DSCA	HOA	AO	HHO	SSA
F1	6.10E − 05	6.10E − 05	6.10E − 05	NaN	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F2	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 04	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F3	6.10E − 05	6.10E − 05	6.10E − 05	1.56E − 02	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F4	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F5	6.10E − 05	3.30E − 01	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	8.54E − 04
F6	6.10E − 05	1.22E − 04	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F7	6.10E − 05	6.10E − 05	4.89E − 01	6.10E − 04	6.10E − 05	4.89E − 01	7.30E − 02	6.10E − 05
F8	0.010254	6.37E − 02	6.10E − 05	6.10E − 05	6.10E − 05	1.21E − 01	6.10E − 05	5.61E − 01
F9	6.10E − 05	6.10E − 05	6.10E − 05	NaN	1.25E − 01	NaN	NaN	6.10E − 05
F10	6.10E − 05	6.10E − 05	6.10E − 05	NaN	6.10E − 05	NaN	NaN	6.10E − 05
F11	NaN	1.95E − 03	6.10E − 05	NaN	3.12E − 02	NaN	NaN	6.10E − 05
F12	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F13	3.05E − 04	6.10E − 04	6.10E − 05	3.89E − 01	2.01E − 03	6.10E − 05	6.10E − 05	3.05E − 04
F14	NaN	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F15	8.90E − 01	2.01E − 03	1.83E − 04	6.10E − 05	6.10E − 05	6.10E − 05	8.36E − 03	6.10E − 05
F16	NaN	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	1.22E − 04	6.10E − 05
F17	NaN	6.10E − 05	2.44E − 04	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	9.76E − 04
F18	NaN	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F19	NaN	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05
F20	8.52E − 01	4.13E − 02	1.35E − 01	6.10E − 05	2.01E − 03	6.10E − 05	6.10E − 05	3.05E − 04
F21	1.68E − 01	6.10E − 05	4.79E − 02	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	1.03E − 02
F22	6.25E − 02	6.10E − 05	2.56E − 02	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	4.13E − 02
F23	7.81E − 03	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	6.10E − 05	2.56E − 02

Function	Name	Dim	Range	f _min
Hybrid functions (N is basic number of functions)
C13	Hybrid function 3 (N = 3)	10	[−100, 100]	1300
C14	Hybrid function 4 (N = 4)	10	[−100, 100]	1400
C15	Hybrid function 5 (N = 4)	10	[−100, 100]	1500
C19	Hybrid function 6 (N = 5)	10	[−100, 100]	1900
Composite functions (N is basic number of functions)
C22	Composite function 2 (N = 3)	10	[−100, 100]	2200
C25	Composite function 5 (N = 5)	10	[−100, 100]	2500
C28	Composite function 8 (N = 6)	10	[−100, 100]	2800
C29	Composite function 9 (N = 6)	10	[−100, 100]	2900

Function		RLTLBO	TLBO	mGWO	MALO	DSCA	HOA	AO	HHO	SSA
C13	Mean	4.38E + 03	6.04E + 03	4.35E + 03	1.78E + 04	6.25E + 05	1.53E + 06	1.77E + 04	1.70E + 04	1.46E + 04
C13	Std	2.76E + 03	4.33E + 03	2.99E + 03	1.30E + 04	4.55E + 05	1.28E + 06	1.39E + 04	1.03E + 04	1.29E + 04
C14	Mean	1.46E + 03	1.47E + 03	1.47E + 03	2.75E + 03	4.78E + 03	3.87E + 03	2.36E + 03	2.20E + 03	3.35E + 03
C14	Std	1.81E + 01	2.40E + 01	1.98E + 01	2.02E + 03	3.76E + 03	1.99E + 03	1.12E + 03	1.05E + 03	3.10E + 03
C15	Mean	1.62E + 03	1.73E + 03	1.74E + 03	8.28E + 03	7.97E + 03	2.49E + 04	5.91E + 03	7.35E + 03	1.06E + 04
C15	Std	5.96E + 01	1.44E + 02	2.36E + 02	5.72E + 03	3.62E + 03	1.54E + 04	2.16E + 03	3.10E + 03	7.51E + 03
C19	Mean	2.00E + 03	2.11E + 03	2.65E + 03	1.54E + 04	3.37E + 04	1.69E + 04	2.10E + 04	1.67E + 04	8.46E + 03
C19	Std	9.63E + 00	3.19E + 02	1.68E + 03	1.23E + 04	3.00E + 04	1.34E + 04	2.88E + 04	1.37E + 04	6.44E + 03
C22	Mean	2.30E + 03	2.30E + 03	2.30E + 00	2.30E + 03	2.55E + 03	2.47E + 03	2.31E + 03	2.41E + 03	2.33E + 03
C22	Std	1.99E + 01	8.68E + 00	9.25E − 01	2.88E + 01	8.10E + 01	4.58E + 02	5.85E + 00	3.85E + 02	1.69E + 02
C25	Mean	2.92E + 03	2.93E + 03	2.92E + 03	2.93E + 03	3.12E + 03	2.97E + 03	2.94E + 03	2.93E + 03	2.92E + 03
C25	Std	2.32E + 01	2.41E + 01	2.33E + 01	2.38E + 01	6.48E + 01	2.35E + 01	2.50E + 01	6.24E + 01	2.45E + 01
C28	Mean	3.23E + 03	3.30E + 03	3.33E + 03	3.31E + 03	3.40E + 03	3.50E + 03	3.44E + 03	3.45E + 03	3.29E + 03
C28	Std	1.15E + 02	1.60E + 02	1.12E + 02	1.47E + 02	9.48E + 01	1.06E + 02	1.09E + 02	1.45E + 02	1.68E + 02
C29	Mean	3.18E + 03	3.19E + 03	3.17E + 03	3.27E + 03	3.38E + 03	3.38E + 03	3.26E + 03	3.37E + 03	3.27E + 03
C29	Std	1.84E + 01	2.16E + 01	2.13E + 01	6.15E + 01	5.77E + 01	6.58E + 01	5.87E + 01	1.20E + 02	7.20E + 01

Algorithm	Optimum variables				Optimum cost
Algorithm	h	l	t	b	Optimum cost
RLTLBO	0.205730	3.253000	9.036600	0.205730	1.695200
SMA [50]	0.205400	3.258900	9.038400	0.205800	1.696040
WOA [14]	0.205396	3.484293	9.037426	0.206276	1.730499
MPA [51]	0.205728	3.470509	9.036624	0.205730	1.724853
MVO [52]	0.205463	3.473193	9.044502	0.205695	1.726450
GA [6]	0.248900	6.173000	8.178900	0.253300	2.430000
HS [53]	0.244200	6.223100	8.291500	0.240000	2.380700

Algorithm	Optimum variables			Optimum weight
Algorithm	d	D	N	Optimum weight
RLTLBO	0.0551180	0.505900	5.1167000	0.01093800
AO [15]	0.0502439	0.352620	10.542500	0.01116500
SSA [12]	0.0512070	0.345215	12.004032	0.01267630
WOA [14]	0.0512070	0.345215	12.004032	0.01267630
GWO [13]	0.0516900	0.356737	11.288850	0.01266600
PSO [11]	0.0517280	0.357644	11.244543	0.01267470
GA [6]	0.0514800	0.351661	11.632201	0.01270478
HS [53]	0.0511540	0.349871	12.076432	0.01267060

Algorithm	RLTLBO	DE [7]	GA [6]	FA [55]	CS [59]	GOA [57]	EOBL-GOA [58]
x1	0.50000	0.50000	0.50005	0.50000	0.50000	0.50000	0.50000
x2	1.11621	1.11670	1.28017	1.36000	1.11643	1.11670	1.11643
x3	0.50000	0.50000	0.50001	0.50000	0.50000	0.50000	0.50000
x4	1.30215	1.30208	1.03302	1.20200	1.30208	1.30208	1.30208
x5	0.50000	0.50000	0.50001	0.50000	0.50000	0.50000	0.50000
x6	1.50000	1.50000	0.50000	1.12000	1.50000	1.50000	1.50000
x7	0.50000	0.50000	0.50000	0.50000	0.50000	0.50000	0.50000
x8	0.34500	0.34500	0.34994	0.34500	0.34500	0.34500	0.34500
x9	0.332814	0.192000	0.192000	0.192000	0.192000	0.192000	0.192000
x10	−19.58840	−19.54935	10.31190	8.87307	−19.54935	−19.54935	−19.54935
x11	0.019066	−0.004310	0.001670	−18.998080	−0.004310	−0.004310	−0.004310
Optimal weight	22.84240	22.84298	22.85653	22.84298	22.84294	22.84474	22.84294

Algorithm	Optimum variables				Optimum cost
Algorithm	Ts	Th	R	L	Optimum cost
RLTLBO	0.7698901	0.4201098	42.536830	171.348900	5926.77920
AO [15]	1.0540000	0.1828060	59.621900	38.8050000	5949.22580
SMA [50]	0.7931000	0.3932000	40.671100	196.217800	5994.18570
WOA [14]	0.8125000	0.4375000	42.098270	176.638998	6059.74100
GWO [13]	0.8125000	0.4345000	42.089200	176.758700	6051.56390
MVO [52]	0.8125000	0.4375000	42.090738	176.738690	6060.80660
GA [6]	0.8125000	0.4375000	42.097398	176.654050	6059.94634
ES [54]	0.8125000	0.4375000	42.098087	176.640518	6059.74560

Algorithm	Optimum variables							Optimum weight
Algorithm	x1	x2	x3	x4	x5	x6	x7	Optimum weight
RLTLBO	3.497600	0.7000	17.0000	7.30000	7.800000	3.350060	5.285530	2995.43740
AO [15]	3.502100	0.7000	17.0000	7.30990	7.747600	3.364100	5.299400	3007.73280
PSO [11]	3.500100	0.7000	17.0002	7.51770	7.783200	3.350800	5.286700	3145.92200
AOA [9]	3.503840	0.7000	17.0000	7.30000	7.729330	3.356490	5.286700	2997.91570
GA [6]	3.510253	0.7000	17.0000	8.35000	7.800000	3.362201	5.287723	3067.56100
SCA [55]	3.508755	0.7000	17.0000	7.30000	7.800000	3.461020	5.289213	3030.56300
HS [53]	3.520124	0.7000	17.0000	8.37000	7.800000	3.366970	5.288719	3029.00200
FA [56]	3.507495	0.7001	17.0000	7.719674	8.080854	3.351512	5.287051	3010.13749

Algorithm	Optimum variables		Optimum weight
Algorithm	x1	x2	Optimum weight
RLTLBO	0.788420000000000	0.408110000000000	263.852300000000
AO [15]	0.792600000000000	0.396600000000000	263.868400000000
SSA [12]	0.788665410000000	0.408275784000000	263.895840000000
AOA [9]	0.793690000000000	0.394260000000000	263.915400000000
MVO [52]	0.788602760000000	0.408453070000000	263.895849900000
GOA [57]	0.788897555578973	0.407619570115153	263.895881496069

Algorithm	Optimum variables		Optimum cost
Algorithm	d	t	Optimum cost
RLTLBO	5.45120	0.29196	26.53130
mGWO	5.45080	0.29201	26.53270
DSCA	5.50250	0.29214	26.79030
HOA	5.26260	0.35487	28.86470
AO	5.46300	0.29656	26.83540
HHO	5.44380	0.29313	26.55820
CS [59]	5.45139	0.29196	26.53217

PERMALINK

An Improved Teaching-Learning-Based Optimization Algorithm with Reinforcement Learning Strategy for Solving Optimization Problems

Di Wu

Shuang Wang

Qingxin Liu

Laith Abualigah

Heming Jia

Abstract

1. Introduction

2. Related Work

2.1. Teaching-Learning-Based Optimization

2.1.1. Teacher Phase

2.1.2. Learner Phase

Figure 1.

2.2. Reinforcement Learning (RL)

Figure 2.

2.3. Random Opposition-Based Learning (ROBL)

Figure 3.

3. The Proposed RLTLBO Algorithm

3.1. New Learning Mode

3.2. Learner Phase with RL Strategy

Figure 4.

Figure 5.

3.3. The Detail of RLTLBO

Algorithm 1.

Figure 6.

3.4. Computational Complexity Analysis

4. Numerical Experiments and Results

4.1. Standard Benchmark Function Experiments

Table 1.

Table 2.

Table 3.

4.1.1. Qualitative Results

Table 4.

Figure 7.

4.1.2. The Wilcoxon Test

Table 5.

4.2. CEC2017 Benchmark Function Experiments

Table 6.

Table 7.

Table 8.

5. Experiments on Industrial Engineering Design Problems

5.1. Welded Beam Design Problem

Figure 8.

Table 9.

5.2. Pressure Vessel Design Problem

Figure 9.

Table 10.

5.3. Tension/Compression Spring Design Problem

Figure 10.

Table 11.

5.4. Speed Reducer Design Problem

Figure 11.

Table 12.

5.5. Three-Bar Truss Design Problem

Figure 12.

Table 13.

5.6. Car Crashworthiness Design Problem

Table 14.

5.7. Tubular Column Design Problem

Figure 13.

Table 15.

5.8. Frequency-Modulated Sound Waves Design Problem

Table 16.

6. Conclusion

Acknowledgments

Contributor Information

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases