Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 6;15:38864. doi: 10.1038/s41598-025-22730-8

An improved differential evolution algorithm based on reinforcement learning and its application

Guangwei Yang 1,2, Peng Sun 1, Jieyong Zhang 1,, Yongzhuang Zhang 1, Tianxin Li 1
PMCID: PMC12592545  PMID: 41198770

Abstract

As a typical swarm intelligence optimization method, the Differential Evolution (DE) algorithm exhibits excellent performance in solving high-dimensional complex problems; however, its parameter sensitivity and premature convergence issues still restrict its practical application effectiveness. Therefore, this paper proposes an improved Differential Evolution algorithm based on reinforcement learning, namely RLDE. First, it adopts the Halton sequence to realize the uniform initialization of the population space, which effectively improves the ergodicity of the initial solution set. Second, it establishes a dynamic parameter adjustment mechanism based on the policy gradient network, and realizes the online adaptive optimization of the scaling factor and crossover probability through the reinforcement learning framework. Furthermore, it classifies the population according to individual fitness values and implements a differentiated mutation strategy. To verify the effectiveness of the proposed algorithm, 26 standard test functions are used for optimization testing, and comparisons are conducted with multiple heuristic optimization algorithms in 10, 30, and 50 dimensions respectively. Experimental results demonstrate that the proposed algorithm significantly enhances the global optimization performance. Furthermore, by modeling and solving the Unmanned Aerial Vehicle (UAV) task assignment problem, the engineering practical value of the algorithm in real-world scenarios is verified from various indicators.

Keywords: Differential evolution algorithm, Halton sequence, Reinforcement learning, Policy gradient network, Hierarchical sorting, Task assignment

Subject terms: Engineering, Mathematics and computing

Introduction

Currently, complex optimization problems in interdisciplinary fields1,2 have become a hot research topic. As an intelligent solution paradigm, heuristic optimization algorithms provide innovative approaches for solving high-dimensional nonlinear optimization problems. Research on heuristic algorithms has achieved dual breakthroughs in theoretical advancement and engineering applications, emerging as a key direction of vigorous development in the field of computational intelligence. For instance, the Genetic Algorithm (GA)3 simulates evolution through gene crossover and mutation; the Particle Swarm Optimization (PSO)4 is based on group collaborative search; and the Artificial Bee Colony (ABC)5 achieves optimization through the division of labor in bee colonies. Additionally, numerous new meta-heuristic algorithms are emerging, such as Simulated Annealing (SA)6, Biogeography-Based Optimization (BBO)79, Neural Dynamics Optimization Algorithm (CNDMS)10, Butterfly Optimization Algorithm (BOA)11, Chimp Optimization Algorithm (CHOA)12, Grey Wolf Optimizer (GWO)13, Sparrow Search Algorithm (SSA)14, and Atomic Orbital Search (AOS)15, among others.

Differential Evolution (DE) is a swarm intelligence optimization algorithm proposed by Storn and Price16, primarily designed to solve complex optimization problems in continuous spaces. Its operating mechanism involves generating candidate solutions through differential mutation of population individuals, and gradually searching for the optimal solution by combining crossover operators and selection strategies. Due to its simple structure, strong robustness, and high convergence efficiency, DE has been widely applied in fields such as engineering optimization design, machine learning parameter optimization, and financial model construction. However, the DE algorithm has inherent mechanistic drawbacks: its parameter settings and evolutionary strategies rely on experience, lacking the capability of adaptive adjustment in dynamic environments. When dealing with complex scenarios such as strongly coupled nonlinearity and high-dimensional multimodality, it tends to cause a decline in population diversity and limit global exploration capability, which manifests as issues including search stagnation in the later stages of iteration, premature trapping in local optima, and difficulty in escaping from extreme value traps.

To address the bottlenecks of the algorithm, researchers have derived a series of new algorithm variants in recent years by introducing strategies such as new operation operators and dynamic parameter adjustment mechanisms. Meanwhile, technologies like reinforcement learning have also been applied to the optimization of evolutionary algorithms. Song et al.17designed an adaptive co-evolutionary differential evolution algorithm with a dynamic hybrid mechanism, QGDECC. In QGDECC, the quantum variable decomposition strategy utilizes qubit strings to adaptively decompose variables according to co-evolutionary performance, making full use of the searched evolutionary information. Deng et al.18 proposed an improved adaptive DE algorithm, ACDE. ACDE designed a generalized learning strategy based on opposition, developed a parameter adaptive adjustment strategy, and introduced the ideas of cultural algorithms and different mutation strategies into the belief space to balance global exploration capability and local optimization capability. Zeng et al.19 proposed an improved differential evolution algorithm, SLDE. This method constructs an external archive to store discarded trial vectors and periodically adds vectors from the external archive to the population during iteration, which periodically increases population diversity and enhances the algorithm’s ability to escape local optima. Chai et al.20 proposed a multi-strategy fusion differential evolution algorithm, MSFDE, which integrates multi-population strategies, novel adaptive strategies, and interactive mutation strategies to balance exploitation and exploration capabilities. Zhou et al.21 proposed an adaptive differential evolution algorithm with a dynamic opposition-based learning strategy, DOLADE. In DOLADE, the opposition-based learning method expands the current elite group and poorly performing group, improving the exploration ability of particles. Yin et al.22 developed a reinforcement learning-based parameter adaptive method (RLAM), which enhances the convergence of PSO by designing a network to control the coefficients of PSO. Chen et al.23 proposed a self-learning genetic algorithm (SLGA), in which the genetic algorithm (GA) is adopted as the basic optimization method, and its key parameters are intelligently adjusted based on reinforcement learning (RL).

To overcome the premature convergence problem of the differential evolution algorithm and enhance its optimization performance, this paper proposes a reinforcement learning-based hierarchical differential evolution algorithm, RLDE. First, the Halton sequence method is used to generate the population, obtaining a uniform initial population solution. Then, reinforcement learning is introduced, and parameters are adaptively adjusted through a policy gradient network. On this basis, a hierarchical mutation mechanism is designed. After evolution, all solutions are sorted by fitness value, and different strategies are adopted for different groups to retain better solutions and improve poorer ones. By comparing with other intelligent optimization algorithms, the performance of the proposed improved differential evolution algorithm is verified on 26 benchmark functions. Meanwhile, the algorithm is applied to the UAV cluster task assignment problem. Experimental results show that the algorithm proposed in this paper outperforms the compared algorithms, with significant engineering value.

The remaining structure of this paper is as follows: Section “Fundamental Theories” briefly introduces the basic DE algorithm and reinforcement learning algorithm; Section “An Improved Differential Evolution Algorithm Based on Reinforcement Learning: RLDE” elaborates on the proposed improved differential evolution algorithm; Section “Experimental Design and Result Analysis” conducts experiments on standard test functions and analyzes the comparison results between the improved algorithm and other 6 heuristic algorithms; Section “Application of RLDE in UAV Task Assignment” carries out UAV task assignment experiments and summarizes the analysis; Section “Conclusion” summarizes the research conclusions of this paper and puts forward future research directions.

Fundamental theories

The parameter sensitivity issue of the DE algorithm requires a dynamic adjustment mechanism. Reinforcement Learning (RL), which achieves strategy optimization through the interaction between an agent and the environment, can adapt to the evolutionary state of DE in real time. Meanwhile, the population evolution process of DE can serve as the interaction environment for RL; the integration of the two can make up for the inherent shortcomings of DE. Therefore, this chapter first introduces the fundamental theories of DE and RL.

Basic differential evolution algorithm

The basic idea of the Differential Evolution algorithm is to perform mutation, crossover, and selection on a randomly generated initial population, and realize the survival of the fittest through continuous iteration and update. The design flow chart is shown in Fig. 1.

Fig. 1.

Fig. 1

Flowchart of the Standard Differential Evolution.

Population initialization

The DE algorithm adopts real-number encoding and generates a randomly initialized population within the feasible solution space. It randomly initializes D-dimensional parameter vectors x with a population size of NP:

graphic file with name d33e321.gif 1

In Eq. (1), Inline graphic represents the initial value of the j-th dimension of the i-th individual, Inline graphic is a random number within the interval [0, 1], Inline graphic denote the upper and lower bounds of the j-th dimension of the individual, respectively.

Mutation

Mutation operation is the core of the differential evolution algorithm, which can generate more excellent individuals in the population. After population initialization, a mutant individual Inline graphic is generated for each individual Inline graphic through mutation operation. The traditional mutation strategy DE/rand/1 is expressed as shown in Eq. (2):

graphic file with name d33e369.gif 2

In Eq. (2), r1r2, and r3 are three random numbers belonging to [1, …, NP], and Inline graphic, which requires that NP must be ≥ 4. The mutation operator F ranges from [0, 2]; a too small F may lead to trapping in local optima, while a too large F may result in difficulty in convergence. In addition, during the development of the algorithm, other mutation strategies such as DE/rand/2, DE/best/1, DE/best/2, and DE/current-to-best/1 have also been proposed.

Crossover

To introduce new individuals into the population and increase population diversity, a crossover operation is required:

graphic file with name d33e423.gif 3

In Eq. (3),Inline graphic is a random number, and CR is the crossover probability factor in the range of [0, 1], which is used to control the selection of mutation vector values. k is an integer uniformly distributed within the dimension of the solution space, ensuring that at least one dimensional component is derived from the mutation vector.

Selection

After the mutation and crossover operations, one is selected from the new individual Inline graphic and the target vector Inline graphic to become the individual for the next iteration. Both are brought into the fitness function for comparison: when the fitness of the new individual is less than that of the target vector, the new individual replaces the original one; otherwise, the original one is retained for the next iteration. The specific manner is shown in Eq. (4):

graphic file with name d33e465.gif 4

Reinforcement learning

Reinforcement learning is a branch of machine learning that aims to design agents capable of learning optimal behavior strategies through interaction with the environment. Unlike supervised learning and unsupervised learning, reinforcement learning does not rely on labeled training data or explicit objective functions; instead, it learns through feedback from the environment. In the basic process of reinforcement learning, an agent observes the state of the environment, takes actions, and then receives rewards or penalties from the environment to evaluate the quality of the actions. The goal is to enable the agent to learn action sequences that maximize the cumulative reward through interaction with the environment, i.e., to learn the optimal strategy. A key aspect of reinforcement learning is maximizing long-term cumulative rewards rather than merely focusing on the immediate reward for a single action. The training flow chart of reinforcement learning is shown in Fig. 2:

Fig. 2.

Fig. 2

Flowchart of reinforcement learning.

Reinforcement learning algorithms can be categorized into policy-based and value-based approaches. Policy-based methods directly output the probability of the next action and select actions based on such probabilities, which are suitable for continuous actions. A common method is Policy gradients. Value-based methods output the value of actions and select the action with the highest value, which are applicable to discrete actions. Typical methods include Q-learning, Deep Q Network (DQN), and Sarsa. Actor-Critic combines the two approaches: the Actor network selects actions based on probabilities, while the Critic network evaluates the value of the actions, thereby accelerating the learning process. Common methods include A2C, A3C, and DDPG.

An improved differential evolution algorithm based on reinforcement learning: RLDE

Population initialization via halton sequence generation

The standard differential evolution algorithm exhibits randomness during population initialization, which may result in the initial population being far from the optimal solution. This leads to a decrease in the convergence speed of the differential evolution algorithm and may even cause it to fall into local optima. To address this issue, the Halton sequence is introduced for population initialization. The Halton sequence is constructed using a deterministic method with prime numbers as bases, generating a population with regularity and ergodicity. This enables the initial population to traverse the entire solution space, ensuring a relatively uniform density distribution of the generated initial population within the solution space. Its common expression is shown in Eq. (5):

graphic file with name d33e496.gif 5

In Eq. (5), each dimension is a uniformly distributed sequence based on different bases bn, where b1…bn are mutually prime numbers.

To verify the initialization advantage of the Halton sequence, its performance was compared with that of Latin Hypercube Sampling (LHS) in 10-dimensional and 30-dimensional solution spaces. In the 10-dimensional scenario: the solution space coverage rate of the Halton sequence was 85.2%, while that of LHS was 78.6%; the average distance from the Halton sequence to the optimal solution was 15.79, compared with 18.05 for LHS. In the 30-dimensional scenario: the coverage rate of the Halton sequence was 79.3%, versus 71.2% for LHS; the average distance from the Halton sequence to the optimal solution was 29.01, in contrast to 31.52 for LHS. Figures 3 and 4 show the comparison diagrams of spatial distribution between the Halton sequence and LHS in 10-dimensional and 30-dimensional spaces, respectively. The results in the figures indicate that the Halton sequence exhibits better uniformity, providing a more favorable initial foundation for subsequent evolution.

Fig. 3.

Fig. 3

Comparison Diagram of Spatial Distribution Between 10-Dimensional Halton Sequence and Latin Hypercube Sampling (LHS).

Fig. 4.

Fig. 4

Comparison Diagram of Spatial Distribution Between 30-Dimensional Halton Sequence and Latin Hypercube Sampling (LHS).

Parameter control based on reinforcement learning

In the differential evolution algorithm, the dynamic adjustment of mutation factor F and crossover probability CR is crucial for alleviating parameter sensitivity and premature convergence. This aligns with the core idea of Gradient Activation Function (GAF)—which mitigates the vanishing gradient problem in deep neural network optimization by dynamically adjusting gradients—both improving optimization efficiency and stability through adaptive regulation of key variables24. This paper designs a policy iteration-based reinforcement learning algorithm, namely the Policy Gradient (PG) algorithm, to adaptively control the mutation factor F and crossover operator CR, achieving dynamic regulation of key parameters.The Policy Gradient (PG) algorithm is a typical policy iteration-based reinforcement learning algorithm. It constructs a policy neural network, which takes the state observed by the agent as input, outputs the action probability distribution, and acts on the environment. After receiving the action, the environment updates its internal state and feeds back a reward value. PG then calculates the gradient and loss function based on the reward value, updates the policy network parameters via gradient descent, and subsequently, the environment feeds back a new state as input for the next training iteration. This cycle continues until the maximum number of training iterations is reached or the optimal policy is output. This paper proposes a differential evolution parameter optimization method based on this algorithm, which realizes adaptive adjustment of differential evolution algorithm parameters by constructing a policy neural network to interact and learn with the population in the differential evolution algorithm. Its algorithm framework is divided into three parts: dynamic interaction between the agent and the environment, internal structure of the agent, and parameter update and optimization process of the differential evolution algorithm, as shown in Fig. 5.

Fig. 5.

Fig. 5

Framework Diagram of Training Differential Evolution Parameters with Policy Gradient Algorithm.

Dynamic interaction between agent and environment

The first part of the framework describes the interaction process between the agent and the environment. In this part, the policy neural network serves as the reinforcement learning agent, and the entire differential evolution process acts as the learning environment.

(1) State space design: The state vector needs to comprehensively reflect the evolutionary state of the population. This paper designs a 5-dimensional feature vector, which includes the current optimal fitness value, the average fitness of the population, the standard deviation of fitness, the proportion of iteration progress, and the success rate of parameter adjustment. The definitions of these features are as follows:

Current optimal fitness value (fbest): fbest reflects the quality of the optimal solution in the current population and directly determines the convergence direction of the algorithm. This indicator is the core target of algorithm optimization. If the optimal fitness remains unupdated for a long time, F needs to be increased to enhance exploration. Its calculation method is shown in Eq. (6):

graphic file with name d33e589.gif 6

Population average fitness (fmean): fmean measures the overall performance of the population. If there is a large gap between fmean and fbest, it indicates that there are a large number of inferior solutions in the population. Combined with fbest, it can be judged whether the population has a situation where the optimal solution is advanced but the overall performance is backward, and CR needs to be adjusted to balance exploration and exploitation. Its calculation method is shown in Eq. (7):

graphic file with name d33e635.gif 7

Fitness standard deviation (fstd): fstd quantifies the degree of dispersion in population fitness, reflecting the diversity of the population. A higher fstd indicates good population diversity but slow convergence. A lower fstd implies that the population tends to converge, which may easily lead to falling into local optima. fstd can be directly linked to parameter adjustment strategies: if fstd is too low, it is necessary to increase F or CR to enhance diversity. Its calculation method is shown in Eq. (8):

graphic file with name d33e690.gif 8

Iteration progress ratio (tratio): tratio indicates whether the algorithm is in the early, middle, or late stage. When tratio < 0.7, the iteration is in the early stage, and emphasis should be placed on global exploration (increasing F). When tratio > 0.7, the iteration is in the late stage, and focus should be on local exploitation (decreasing F and increasing CR). Its calculation method is shown in Eq. (9):

graphic file with name d33e735.gif 9

Parameter adjustment success rate (Successrate): Successrate represents the proportion of fitness improvements achieved through parameter adjustments in the latest 10 iterations, reflecting the effectiveness of the current parameter strategy. It is calculated as the proportion of rewards Rt > 0 in the latest 10 iterations.

To avoid the impact of dimensional differences on network training, each of the above features is normalized. The Min–Max method is adopted for normalization, scaling each feature to the range [0, 1]. The specific method is shown in Eq. (10):

graphic file with name d33e762.gif 10

(2) Action space design: This paper designs a five-dimensional action space, which is determined by the probability distribution output by the policy neural network, as shown in Table 1.

Table 1.

Action Selection Table.

Action 0 1 2 3 4
Pi P0 P1 P2 P3 P4

In Table 1, P0 +P1 +P2 +P3 +P4 = 1 and a = f(Pi), where f is an action selection function that will choose the action with the highest probability according to the probability distribution output by the policy neural network. If P0 has the highest value, then a = 0, and the agent will output a = 0 to the environment. If P3 has the highest value, then a = 3, and the agent will output a = 3 to the environment.

(3) Reward function design: To better drive the results of the differential evolution algorithm toward the optimal value, the reward function is designed as shown in Eq. (11). For minimization problems, if the optimal fitness value of the current generation population is smaller than that of the previous generation population, a positive reward is given; otherwise, a negative reward (i.e., -0.5) is assigned.

graphic file with name d33e892.gif 11

In Eq. (11), fold denotes the optimal fitness value of the previous generation population, and fnew represents the optimal fitness value of the current generation population. Inline graphic is an extremely small constant, where Inline graphic, and its purpose is to prevent the denominator from being zero.

Internal structure of the agent

The second part of this framework describes the internal structure of the agent used in the algorithm, which consists of a policy neural network. The policy network adopts a 3-layer fully connected structure: (1) Input layer: 5 neurons, corresponding to 5-dimensional normalized states (fbest, fmean, fstd, tratio, Successrate); (2) Hidden layer: 16 neurons, with the tanh activation function to avoid gradient vanishing; (3) Output layer: 5 neurons, corresponding to 5-dimensional action probabilities, with the softmax activation function.The training hyperparameters are set as follows: learning rate Inline graphic, BatchSize = 32, discount factor (Inline graphic) , the Adam optimizer is used, and the maximum number of training iterations Trainmax = 100. The internal structure of the agent is shown in Fig. 6.

Fig. 6.

Fig. 6

Internal structure diagram of the agent.

The policy neural network designed in this paper has a five-dimensional state observation space as input, which includes the current optimal fitness value, the average fitness value of the population, the standard deviation of fitness values, the iteration progress ratio, and the success rate of parameter adjustments. The output is a five-dimensional action probability distribution. The parameters of the policy neural network are updated by minimizing the loss function L(Inline graphic) and the update method is shown in Eq. (12):

graphic file with name d33e1012.gif 12

In Eq. (12), Inline graphic denotes the learning rate, which can adjust the update range of the neural network; Inline graphic represents the gradient of the policy with respect to the parameters Inline graphic; Inline graphic indicates the probability of taking action Inline graphic under state Inline graphic; Inline graphic denotes the discount factor; and Inline graphic represents the cumulative discounted reward obtained by the agent after one training session. The designed policy neural network includes one hidden layer, with the tanh function serving as the activation function for this hidden layer.

Update and optimization of DE parameters

The third part of the framework describes the parameter update and optimization process of the differential evolution algorithm. The algorithm leverages the designed policy neural network and action selection function to achieve adaptive adjustment of the differential evolution algorithm parameters. The specific update rules for the parameters are shown in Table 2.

Table 2.

Parameter update rules.

a F CR
0  + = 0.05 unchanged
1 - = 0.05 unchanged
2 unchanged  + = 0.05
3 unchanged - = 0.05
4 unchanged unchanged

The effective interval of F is [0.1, 1.5], and that of CR is [0.1, 0.9]. If a parameter update exceeds its interval, automatic clipping to the nearest endpoint is performed. For example:If the current \(F = 0.8\) and the Agent outputs action a = 0 (i.e., (F + = 0.05)), the updated F is calculated as Inline graphic;If the current CR = 0.9 and the Agent outputs action a = 3 (i.e., CR- = 0.05), the updated CR is Inline graphic;If the current F = 1.5 and the Agent outputs a = 0, F remains unchanged at 1.5.After obtaining the new parameters, the differential evolution algorithm performs crossover and mutation. Finally, the updated individual optimal solution and global optimal solution are used as the input for the agent’s next training iteration to start a new round of training.

Regarding the parameter adjustment magnitude: in the early stage, targeting the Elliptic function in a 10-dimensional space, three adjustment magnitudes (± 0.02, ± 0.05, ± 0.1) were tested. The results show that ± 0.05 is the optimal magnitude that balances accuracy and stability. Therefore, ± 0.05 is selected as the parameter adjustment magnitude in this paper.

Hierarchical mutation mechanism

Hierarchical mechanisms have been used by researchers in the improvement of optimization algorithms25. In the DE algorithm, the fitness of individuals in each generation of the population sometimes varies significantly. To achieve in-depth exploitation of superior individuals while enhancing the mutation of inferior individuals, a hierarchical sorting mechanism is proposed. This mechanism divides individuals into two categories based on their fitness values and adopts different mutation strategies to improve the algorithm’s performance.Before mutation, the algorithm first calculates the mean μ and standard deviation σ of the fitness of the current population. It excludes outlier individuals with fitness values greater than μ +, temporarily stores these outliers in a temporary set, and uniformly assigns them to the two groups after stratification. Subsequently, the remaining individuals are sorted in descending order of fitness values. The size of the elite group is Inline graphic and the size of the inferior group is Inline graphic as shown in Eq. (13):

graphic file with name d33e1187.gif 13

In Eq. (13), Inline graphic denotes the elite group, Inline graphic denotes the inferior group, Inline graphic is the set of outlier individuals, and Inline graphic represents randomly selecting k individuals from U.Based on this, a two-layer mutation mechanism is adopted: For the elite group, since de/best/bin has the characteristic of accelerating convergence speed, this strategy is used to retain excellent genes and speed up convergence. The de/best/bin strategy is shown in Eq. (14):

graphic file with name d33e1232.gif 14

In Eq. (14), Inline graphic is the current global optimal individual, and the meanings of other symbols are the same as those in the previous chapter.

For the inferior group, since de/rand/bin has the characteristic of extensive search, this strategy is adopted to find the optimal solution more quickly and enhance the algorithm’s optimization ability. The de/rand/bin strategy is shown in Eq. (15):

graphic file with name d33e1254.gif 15

In Eq. (15), the meanings of all symbols are the same as those in the previous chapter.Finally, the evolved individuals of the elite group and the inferior group are summarized and merged to generate the next-generation population.

Algorithm steps and process design

The complete flow chart of the algorithm is shown in Fig. 7. The algorithm is divided into two phases: first, the parameter training phase of the policy network, followed by the parameter adjustment phase of the differential evolution algorithm. The complete algorithm steps are presented in Algorithm 1:

Algorithm 1.

Algorithm 1

Reinforcement Learning-based Differential Evolution (RLDE)

Fig. 7.

Fig. 7

Flowchart of the RLDE Algorithm.

Complexity analysis

The time complexity of the RLDE algorithm is analyzed in conjunction with its two-stage process. The core influencing variables include the number of training epochs of the policy network(Trainmax), the number of DE iterations(Gmax), population size(NP), problem dimension(n) , the number of hidden layer nodes in the policy neural network(H) , and the sampling batch size of the Replay Buffer(BatchSize). Among these, the policy network training stage is the dominant part of the time complexity. This stage requires looping for Trainmax epochs. In each epoch, first, the initialization of n-dimensional individuals is completed through the Halton sequence (with a time cost of O(NP·n)), and the initial fitness is calculated (O(NP·n)). Then Gmax iterations are performed. Each iteration includes operations such as state calculation and normalization (O(NP)), policy network forward propagation for action selection (O(H)), hierarchical mutation (first sorting individuals by fitness (O(NPlogNP)), then splitting the population and performing mutation separately), and crossover and selection (O(NP·n)). After each training epoch, the network also needs to be updated based on BatchSize experience samples (Inline graphic). The DE parameter adjustment stage does not require network training and only performs Gmax DE iterations, with a time cost of Inline graphic. After ignoring low-order terms, the total time complexity of RLDE is approximately Inline graphic, and it is overall dominated by the policy network training stage.

The space complexity of the RLDE algorithm is mainly determined by data storage requirements, with the core storage overhead coming from population and fitness data, policy network parameters, and the Replay Buffer. Population storage requires saving NP n-dimensional individuals, resulting in a space overhead of (O(NP·n)); fitness data needs to store the fitness values of NP individuals(O(NP)); policy network parameters include input-hidden layer weights (5H), hidden-output layer weights (5H), and biases (H +5), with a space overhead of (O(H)); the Replay Buffer needs to store Trainmax·Gmax experience samples. Since the order of magnitude of NP·n is much larger than that of H and Trainmax·Gmax,the total space complexity of RLDE is approximately O(NP·n).

Experimental design and result analysis

Experimental design

Benchmark test functions

The performance verification of optimization algorithms requires a combination of theoretical analysis and experimental testing. For example, the Gradient-based Differential Neural Solution (GDN) has its exponential convergence theoretically proven in time-varying nonlinear optimization problems, and the effectiveness of the algorithm is verified through robot motion planning experiments26. In this paper, the optimization performance test of the RLDE algorithm on 10/30/50-dimensional standard test functions also follows the principle of “guidance by theoretical logic +multi-dimensional experimental verification”, aiming to comprehensively evaluate the global optimization capability of the algorithm. Twenty-six test functions were selected for the experiment, as shown in Table 3. These test functions cover types such as unimodal, multimodal, non-separable, separable, shifted, rotated, scalable, and non-scalable, which can comprehensively test the global optimization performance of the algorithm.

Table 3.

Benchmark function table.

Serial Number Function Mathematical Expression range Optimal Solution
1 Bent Cigar Inline graphic [− 10,10] 0
2 sphere Inline graphic [− 10,10] 0
3 Elliptic Inline graphic [− 10,10] 0
4 Schwefel 1.2 Inline graphic [− 10,10] 0
5 Schwefel 2.21 Inline graphic [− 10,10] 0
6 Schwefel 2.22 Inline graphic [− 1,1] 0
7 Discus Inline graphic [− 10,10] 0
8 Sum of different power Inline graphic [− 10,10] 0
9 Sum Squares Inline graphic [− 1,1] 0
10 Different Powers Inline graphic [− 10,10] 0
11 Exponential Inline graphic [− 1,1] − 1
12 Zakharov Inline graphic [0.5,1] 0
13 Rosenbrock Inline graphic [− 3,3] 0
14 Griewank Inline graphic [− 60,60] 0
15 Rastrigin Inline graphic [− 0.512,0.512] 0
16 Apline Inline graphic [− 10,10] 0
17 Salomon Inline graphic [− 10,10] 0
18 Scaffer2

Inline graphic

Inline graphic

[− 10,10] 0
19 Ackley Inline graphic [− 3.2,3.2] 0
20 Weierstrass Inline graphicInline graphicInline graphic [− 0.05,0.05] 0
21 HappyCat Inline graphic [− 10,10] 0
22 HGBat Inline graphic [− 10,10] 0
23 Scaffer’s F6

Inline graphic

Inline graphic

[− 0.05,0.05] 0
24 NCRastrigin Inline graphicInline graphicInline graphicInline graphic [− 1,1] 0
25 Step Inline graphic [− 10,10] 0
26 Noise quartic Inline graphic [− 0.128,0.128] 0

Experimental methods and parameter settings

In the experiment, other classic meta-heuristic algorithms were selected for comparative experiments, including Differential Evolution (DE), Particle Swarm Optimization (PSO), Butterfly Optimization Algorithm (BOA), and other excellent Differential Evolution variants such as JADE27, SHADE28, and LSHADE29. The parameters of each algorithm are shown in Table 4:

Table 4.

Parameter setting table of algorithms.

Algorithm Parameter setting
DE cr = 0.9 F = 0.8
PSO w = 0.5 c1 = 1.5 c2 = 1.5
BOA Sense_scale = 0.1 switch_prob = 0.8
JADE p = 0.1 c = 0.5
SHADE p = 0.1 c = 0.5 H = 100
LSHADE p = 0.1 c = 0.5 H = 100

The basic configuration parameters of the experimental computer are: Intel(R) Pentium(R) CPU G2020@2.90GHz, 6GB memory, with the operating system being Windows 10. The experimental algorithms were implemented in Python 3.7. The basic parameters of the algorithms are as follows: population size NP = 50; each algorithm was run for 100 iterations; the training epochs of the RLDE algorithm proposed in this paper were set to 100. The experiments were conducted 30 times separately under the conditions of 10, 30, and 50 dimensions.

Experimental results and analysis

Accuracy analysis

To verify the superiority of the proposed improved Differential Evolution algorithm (RLDE), Tables 5, 6, and 7 respectively present the results of the mean and standard deviation of the optimal solutions obtained from 30 independent runs of 26 test functions under different dimensions. Subsequently, the Friedman test was performed using the mean values from the 30 independent runs to derive the ranking results. It can be seen from the ranking results in the tables that the RLDE algorithm performs better in most test functions.

Table 5.

Function operation result values of 10–dimension.

Function Index RLDE DE PSO BOA JADE SHADE LSHADE
bent_cigar mean 1.37E+03 1.98E+07 2.38E+08 3.90E+08 1.08E+07 1.18E+07 4.23E+06
std 6.88E+02 6.84E+06 7.22E+07 1.12E+08 8.11E+06 1.44E+07 3.95E+06
rank 1 5 6 7 3 4 2
sphere mean 9.37E−04 4.90E+00 3.32E+01 4.58E+01 1.56E+00 2.00E+00 3.26E−01
std 5.52E−04 2.16E+00 7.71E+00 1.35E+01 1.17E+00 2.26E+00 5.11E−01
rank 1 5 6 7 3 4 2
elliptic mean 1.90E+00 2.49E+03 1.05E+06 5.09E+06 5.63E+04 3.60E+04 5.26E+03
std 8.27E−01 1.11E+03 4.56E+05 5.17E+06 5.64E+04 6.82E+04 9.14E+03
rank 1 2 6 7 5 4 3
Schwefel 1.2 mean 1.73E+01 1.64E+01 3.88E+01 4.26E+01 1.88E+01 1.09E−01 1.84E+01
std 1.32E+01 4.33E+00 1.27E+01 1.52E+01 1.04E+01 6.38E−02 6.63E+00
rank 3 2 6 7 5 1 4
Schwefel 2.21 mean 6.13E−02 6.96E−02 3.17E+00 3.90E+00 3.57E+00 2.61E+00 3.15E+00
std 1.01E−01 3.34E−02 5.13E−01 4.79E−01 5.48E−01 1.33E+00 6.79E−01
rank 1 2 5 7 6 3 4
Schwefel 2.22 mean 6.87E−03 5.37E−01 1.38E+00 1.69E+00 2.50E−01 3.17E−01 8.56E−02
std 2.07E−03 9.91E−02 2.37E−01 2.70E−01 8.88E−02 1.69E−01 4.68E−02
rank 1 5 6 7 3 4 2
discus mean 3.88E−03 9.29E+00 2.38E+02 1.19E+02 2.50E+01 2.00E+01 5.44E+00
std 2.69E−03 2.81E+00 6.59E+01 4.64E+01 1.67E+01 1.31E+01 5.58E+00
rank 1 3 7 6 5 4 2
Sum of Different Power mean 3.00E−05 4.25E+00 4.90E+02 3.56E+03 9.51E−01 6.11E+00 9.84E−03
std 2.35E−05 2.27E+00 3.46E+02 5.15E+03 1.64E+00 1.54E+01 1.85E−02
rank 1 4 6 7 3 5 2
Sum Squares mean 8.19E−06 6.62E−02 1.08E+00 1.71E+00 8.93E−03 6.63E−02 4.89E−03
std 4.14E−06 2.50E−02 3.92E−01 4.81E−01 8.22E−03 6.11E−02 2.22E−03
rank 1 4 6 7 3 5 2
Different Powers mean 6.06E−03 1.85E+00 9.90E+00 1.59E+01 1.45E+00 1.62E+00 3.92E−01
std 4.20E−03 6.01E−01 2.76E+00 5.33E+00 5.62E−01 1.51E+00 4.09E−01
rank 1 5 6 7 3 4 2
Exponential mean −1.00E+00 −1.00E+00 −9.98E−01 −9.97E−01 −1.00E+00 −1.00E+00 −9.92E−01
std 2.62E−08 9.14E−05 4.26E−04 6.40E−04 6.36E−05 9.86E−05 2.99E−03
rank 2.5 2.5 5 6 2.5 2.5 7
zakharov mean 1.06E−01 4.77E+01 7.46E+01 7.11E+01 5.34E+01 3.93E+01 2.60E+01
std 5.14E−02 1.61E+01 1.75E+01 2.16E+01 3.11E+01 2.22E+01 1.40E+01
rank 1 4 7 6 5 3 2
rosenbrock mean 6.47E+00 3.18E+01 1.65E+02 4.86E+02 3.99E+01 3.06E+01 1.81E+01
std 9.48E−01 9.50E+00 6.30E+01 1.59E+02 2.14E+01 1.78E+01 2.26E+01
rank 1 4 6 7 5 3 2
Griewank mean 3.22E−01 8.71E−01 1.27E+00 1.44E+00 5.38E−01 3.29E−01 3.34E−01
std 9.37E−02 1.04E−01 9.94E−02 1.18E−01 1.71E−01 1.84E−01 1.60E−01
rank 1 5 6 7 4 2 3
Rastrigin mean 5.04E−04 2.32E+00 1.64E+01 2.31E+01 7.82E−01 8.23E−01 1.89E−01
std 4.58E−04 1.26E+00 4.18E+00 5.75E+00 4.11E−01 5.95E−01 2.76E−01
rank 1 5 6 7 3 4 2
Apline mean 4.54E−01 6.45E+00 7.51E+00 8.38E+00 8.53E−01 1.11E+00 1.18E+00
std 4.87E−01 1.25E+00 8.53E−01 1.60E+00 4.22E−01 8.62E−01 5.11E−01
rank 1 5 6 7 2 3 4
Salomon mean 4.73E−01 7.80E−01 7.74E−01 8.88E−01 5.13E−01 3.09E−01 5.96E−01
std 1.59E−01 1.08E−01 9.33E−02 1.18E−01 1.78E−01 5.32E−02 1.66E−01
rank 2 6 5 7 3 1 4
Scaffer2 mean 1.26E+00 3.17E+00 3.56E+00 6.18E+00 2.43E+00 2.52E+00 1.69E+00
std 6.48E−01 1.07E+00 9.15E−01 1.58E+00 1.12E+00 1.05E+00 4.79E−01
rank 1 5 6 7 3 4 2
Ackley mean 1.32E−02 2.22E+00 3.70E+00 4.14E+00 1.03E+00 1.14E+00 3.77E−01
std 3.43E−03 4.01E−01 3.69E−01 4.05E−01 4.18E−01 5.40E−01 5.13E−01
rank 1 5 6 7 3 4 2
weierstrass mean 1.07E−01 1.54E+00 2.36E+00 2.82E+00 5.75E−01 7.90E−01 8.15E−01
std 2.59E−02 1.86E−01 2.00E−01 3.57E−01 1.35E−01 2.12E−01 1.66E−01
rank 1 5 6 7 2 3 4
HappyCat mean 8.63E−02 1.68E−01 1.77E−01 1.01E−01 1.06E−01 1.49E−01 1.46E−01
std 1.86E−02 3.05E−02 3.33E−02 2.17E−02 2.46E−02 2.74E−02 4.05E−02
rank 1 6 7 2 3 5 4
HGBat mean 9.36E−02 1.09E−01 1.03E−01 4.67E−02 9.03E−02 4.79E−02 5.60E−02
std 2.52E−02 2.93E−02 3.39E−02 1.66E−02 2.10E−02 2.47E−02 1.49E−02
rank 5 7 6 1 4 2 3
Scaffer’s F6 mean 3.61E−08 2.36E−04 1.71E−03 2.59E−03 1.03E−04 8.25E−05 7.71E−03
std 2.61E−08 9.38E−05 4.46E−04 6.82E−04 9.08E−05 6.45E−05 2.82E−03
rank 1 4 5 6 3 2 7
NCRastrigin mean 3.33E+00 6.58E+00 1.01E+01 5.26E+01 3.81E+00 7.32E+00 4.79E+00
std 1.12E+00 7.42E−01 1.91E+00 8.50E+00 2.00E+00 1.86E+00 1.54E+00
rank 1 4 6 7 2 5 3
step mean 2.94E−02 2.09E+00 5.71E+00 6.98E+00 1.24E+00 1.15E+00 4.88E−01
std 9.76E−03 4.38E−01 7.00E−01 6.18E−01 4.19E−01 3.77E−01 4.19E−01
rank 1 5 6 7 4 3 2
Noise quartic mean 4.65E−01 5.50E−01 2.79E−03 5.62E−01 8.66E−04 1.73E−03 3.69E−03
std 2.82E−01 2.52E−01 1.14E−03 2.34E−01 2.93E−04 1.08E−03 2.05E−03
rank 5 6 3 7 1 2 4
Table 6.

Function Operation Result Values of 30—Dimension.

Function Index RLDE DE PSO BOA JADE SHADE LSHADE
bent_cigar mean 6.40E+07 3.43E+09 4.15E+09 3.83E+09 3.82E+08 4.43E+08 4.91E+08
std 2.38E+07 5.17E+08 3.44E+08 4.00E+08 1.23E+08 1.25E+08 1.61E+08
rank 1 5 7 6 2 3 4
sphere mean 8.01E+00 3.69E+02 4.39E+02 3.99E+02 4.05E+01 4.63E+01 2.71E+01
std 2.94E+00 5.31E+01 4.05E+01 4.49E+01 1.04E+01 1.55E+01 1.02E+01
rank 1 5 7 6 3 4 2
elliptic mean 4.96E+04 4.81E+06 3.53E+07 8.90E+07 3.00E+06 3.57E+06 9.33E+05
std 1.76E+04 9.07E+05 8.97E+06 3.10E+07 1.83E+06 2.19E+06 7.44E+05
rank 1 5 6 7 3 4 2
Schwefel 1.2 mean 1.56E+02 5.45E+02 6.15E+02 6.78E+02 2.08E+02 1.60E+02 3.12E+02
std 7.15E+01 8.08E+01 9.26E+01 1.93E+02 5.73E+01 3.53E+01 5.46E+01
rank 1 5 6 7 3 2 4
Schwefel 2.21 mean 3.47E+00 6.42E+00 9.46E+00 6.67E+00 5.46E+00 5.55E+00 5.95E+00
std 6.58E−01 9.10E−01 7.15E−01 2.58E−01 3.50E−01 3.05E−01 4.46E−01
rank 1 5 7 6 2 3 4
Schwefel 2.22 mean 1.13E+00 7.93E+00 8.88E+00 8.90E+00 2.37E+00 2.72E+00 1.83E+00
std 1.82E−01 5.68E−01 5.16E−01 5.22E−01 4.13E−01 3.95E−01 3.42E−01
rank 1 5 6 7 3 4 2
discus mean 1.35E+01 4.12E+02 9.85E+02 8.90E+03 1.24E+02 1.45E+02 1.09E+02
std 3.90E+00 6.59E+01 1.41E+02 2.43E+04 4.23E+01 5.50E+01 3.99E+01
rank 1 5 6 7 3 4 2

Sum of

Different Power

mean 1.12E+06 1.25E+15 2.32E+16 1.34E+18 2.18E+06 1.93E+08 5.89E+07
std 5.74E+06 2.37E+15 5.94E+16 1.70E+18 4.85E+06 9.67E+08 2.26E+08
rank 1 5 6 7 2 4 3
Sum Squares mean 7.67E−01 3.99E+01 5.15E+01 5.15E+01 2.18E+00 5.69E+00 8.24E−01
std 2.76E−01 7.18E+00 7.40E+00 7.29E+00 8.34E−01 2.06E+00 4.93E−01
rank 1 5 6.5 6.5 3 4 2

Different

Powers

mean 4.73E+00 8.19E+01 1.11E+02 1.32E+02 1.42E+01 1.43E+01 1.11E+01
std 9.08E−01 1.18E+01 1.81E+01 2.56E+01 4.41E+00 4.36E+00 3.47E+00
rank 1 5 6 7 3 4 2
Exponential mean −1.00E+00 −9.82E−01 −9.79E−01 −9.80E−01 −9.98E−01 −9.98E−01 −9.82E−01
std 1.01E−04 2.87E−03 2.97E−03 1.83E−03 5.30E−04 6.47E−04 3.19E−03
rank 1 4.5 7 6 2.5 2.5 4.5
zakharov mean 1.33E+02 9.70E+02 7.16E+02 4.99E+02 2.46E+02 2.27E+02 5.48E+02
std 2.58E+01 1.35E+02 7.64E+01 6.63E+01 7.71E+01 1.04E+02 1.32E+02
rank 1 7 6 4 3 2 5
rosenbrock mean 1.30E+02 1.63E+03 3.45E+03 7.57E+03 4.87E+02 5.13E+02 4.37E+02
std 4.53E+01 3.61E+02 9.31E+02 1.38E+03 1.91E+02 1.50E+02 1.47E+02
rank 1 5 6 7 3 4 2
Griewank mean 8.95E−01 4.30E+00 4.94E+00 4.55E+00 1.34E+00 1.39E+00 1.28E+00
std 9.61E−02 6.43E−01 4.40E−01 4.07E−01 1.06E−01 1.11E−01 1.04E−01
rank 1 5 7 6 3 4 2
Rastrigin mean 3.56E+00 1.30E+02 1.49E+02 1.61E+02 2.01E+01 2.36E+01 1.25E+01
std 1.00E+00 1.66E+01 1.26E+01 1.48E+01 5.01E+00 5.76E+00 3.99E+00
rank 1 5 6 7 3 4 2
Apline mean 9.24E+00 5.08E+01 4.72E+01 5.09E+01 1.27E+01 1.24E+01 1.68E+01
std 2.77E+00 3.25E+00 3.73E+00 5.37E+00 2.39E+00 2.66E+00 3.33E+00
rank 1 6 5 7 3 2 4
Salomon mean 1.10E+00 2.45E+00 2.32E+00 2.17E+00 1.35E+00 1.34E+00 1.58E+00
std 1.57E−01 1.92E−01 1.48E−01 9.95E−02 1.94E−01 1.77E−01 1.85E−01
rank 1 7 6 5 3 2 4
Scaffer2 mean 1.77E+01 3.24E+01 3.15E+01 4.29E+01 3.15E+01 2.43E+01 2.15E+01
std 3.01E+00 2.99E+00 3.16E+00 4.82E+00 3.33E+00 4.40E+00 2.77E+00
rank 1 6 4.5 7 4.5 3 2
Ackley mean 1.86E+00 5.64E+00 6.01E+00 5.75E+00 2.71E+00 2.86E+00 2.58E+00
std 3.26E−01 4.32E−01 2.10E−01 3.09E−01 2.51E−01 2.97E−01 3.44E−01
rank 1 5 7 6 3 4 2
weierstrass mean 2.64E+00 1.09E+01 1.14E+01 1.15E+01 4.49E+00 5.25E+00 6.18E+00
std 2.79E−01 6.88E−01 5.18E−01 4.98E−01 5.87E−01 5.22E−01 5.41E−01
rank 1 5 6 7 2 3 4
HappyCat mean 1.08E−01 2.13E−01 2.07E−01 1.45E−01 1.48E−01 1.76E−01 1.55E−01
std 2.33E−02 3.80E−02 4.60E−02 3.51E−02 2.60E−02 3.93E−02 3.42E−02
rank 1 7 6 2 3 5 4
HGBat mean 1.68E−01 1.99E−01 1.73E−01 2.56E−01 1.60E−01 8.32E−02 9.33E−02
std 6.80E−02 6.72E−02 5.26E−02 3.83E−01 6.14E−02 4.46E−02 3.25E−02
rank 4 6 5 7 3 1 2
Scaffer’s F6 mean 3.72E−04 1.88E−02 2.22E−02 2.03E−02 1.99E−03 2.45E−03 1.78E−02
std 1.13E−04 2.79E−03 2.23E−03 1.97E−03 6.29E−04 6.44E−04 4.18E−03
rank 1 5 7 6 2 3 4
NCRastrigin mean 2.22E+01 2.78E+01 9.92E+01 2.21E+02 7.26E+01 2.85E+01 6.93E+01
std 4.54E+00 8.12E−01 1.19E+01 1.83E+01 1.69E+01 3.15E+00 1.42E+01
rank 1 2 6 7 5 3 4
step mean 2.71E+00 1.86E+01 2.08E+01 2.01E+01 6.14E+00 7.00E+00 4.80E+00
std 4.71E−01 1.73E+00 1.14E+00 1.03E+00 8.09E−01 1.11E+00 7.94E−01
rank 1 5 7 6 3 4 2
Noise quartic mean 5.80E−01 4.54E−01 2.10E−02 5.14E−01 3.22E−03 6.24E−03 1.64E−02
std 2.43E−01 3.06E−01 3.93E−03 2.95E−01 1.15E−03 2.13E−03 4.72E−03
rank 7 5 4 6 1 2 3
Table 7.

Function Operation Result Values of 50—Dimension.

Function Index RLDE DE PSO BOA JADE SHADE LSHADE
bent_cigar mean 1.55E+08 7.67E+09 9.81E+09 8.32E+09 9.79E+08 9.94E+08 1.53E+09
std 3.84E+07 1.11E+09 8.18E+08 6.49E+08 2.43E+08 2.25E+08 3.30E+08
rank 1 5 7 6 2 3 4
sphere mean 5.69E+01 7.89E+02 1.01E+03 8.33E+02 1.08E+02 1.09E+02 1.09E+02
std 1.11E+01 1.33E+02 6.71E+01 5.20E+01 2.77E+01 2.67E+01 2.32E+01
rank 1 5 7 6 2 3.5 3.5
elliptic mean 8.70E+05 4.32E+07 1.35E+08 2.70E+08 1.10E+07 1.45E+07 6.43E+06
std 2.58E+05 9.68E+06 2.73E+07 7.02E+07 4.32E+06 8.44E+06 4.42E+06
rank 1 5 6 7 3 4 2
Schwefel 1.2 mean 5.08E+02 1.54E+03 1.64E+03 1.90E+03 5.90E+02 7.07E+02 8.70E+02
std 2.24E+02 1.77E+02 1.90E+02 5.17E+02 2.18E+02 1.15E+02 1.60E+02
rank 1 5 6 7 2 3 4
Schwefel 2.21 mean 5.97E+00 8.93E+00 1.00E+01 7.59E+00 6.09E+00 6.00E+00 6.81E+00
std 6.17E−01 3.88E−01 0.00E+00 1.61E−01 3.20E−01 2.33E−01 3.38E−01
rank 1 6 7 5 3 2 4
Schwefel 2.22 mean 3.87E+00 1.55E+01 1.76E+01 1.72E+01 5.35E+00 5.85E+00 4.85E+00
std 5.18E−01 1.22E+00 6.95E−01 8.33E−01 5.28E−01 6.86E−01 6.19E−01
rank 1 5 7 6 3 4 2
discus mean 8.47E+01 9.32E+02 1.70E+03 5.48E+03 2.50E+02 2.86E+02 2.61E+02
std 2.13E+01 8.33E+01 1.77E+02 1.56E+04 5.71E+01 7.94E+01 6.67E+01
rank 1 5 6 7 2 4 3

Sum of

Different Power

mean 4.47E+15 2.71E+32 1.50E+33 1.04E+35 5.01E+16 6.95E+16 3.40E+20
std 1.18E+16 9.22E+32 3.20E+33 2.27E+35 1.98E+17 3.35E+17 1.82E+21
rank 1 5 6 7 2 3 4
Sum Squares mean 1.03E+01 1.73E+02 2.10E+02 1.95E+02 1.22E+01 2.24E+01 1.08E+01
std 1.93E+00 2.63E+01 1.95E+01 1.98E+01 2.87E+00 5.76E+00 3.39E+00
rank 1 5 7 6 3 4 2

Different

Powers

mean 2.06E+01 2.14E+02 2.71E+02 2.59E+02 2.72E+01 2.74E+01 2.94E+01
std 4.43E+00 3.32E+01 3.31E+01 3.88E+01 6.47E+00 7.17E+00 6.66E+00
rank 1 5 7 6 2 3 4
Exponential mean −9.97E−01 −9.58E−01 −9.52E−01 −9.58E−01 −9.95E−01 −9.95E−01 −9.74E−01
std 6.08E−04 4.77E−03 3.99E−03 3.02E−03 1.18E−03 1.13E−03 3.83E−03
rank 1 5.5 7 5.5 2.5 2.5 4
zakharov mean 3.51E+02 1.81E+03 1.61E+03 1.04E+03 4.98E+02 4.46E+02 1.09E+03
std 4.21E+01 2.89E+02 1.56E+02 1.16E+02 1.17E+02 1.48E+02 2.12E+02
rank 1 7 6 4 3 2 5
rosenbrock mean 6.48E+02 7.37E+03 1.25E+04 1.76E+04 1.39E+03 1.60E+03 1.38E+03
std 1.32E+02 1.88E+03 2.36E+03 2.30E+03 3.64E+02 3.91E+02 4.28E+02
rank 1 5 6 7 3 4 2
Griewank mean 1.50E+00 8.28E+00 9.89E+00 8.51E+00 1.99E+00 1.93E+00 1.94E+00
std 1.31E−01 1.15E+00 6.07E−01 7.54E−01 2.78E−01 2.01E−01 2.61E−01
rank 1 5 7 6 4 2 3
Rastrigin mean 2.58E+01 2.79E+02 3.17E+02 3.32E+02 5.15E+01 5.62E+01 4.70E+01
std 5.47E+00 3.06E+01 1.16E+01 1.85E+01 1.14E+01 1.13E+01 1.51E+01
rank 1 5 6 7 3 4 2
Apline mean 2.24E+01 9.84E+01 9.95E+01 1.00E+02 2.84E+01 2.86E+01 3.91E+01
std 2.92E+00 7.36E+00 4.01E+00 7.61E+00 3.92E+00 3.53E+00 4.22E+00
rank 1 5 6 7 2 3 4
Salomon mean 1.52E+00 3.46E+00 3.40E+00 3.01E+00 2.00E+00 1.94E+00 2.25E+00
std 1.50E−01 1.58E−01 1.51E−01 1.38E−01 1.69E−01 1.89E−01 1.84E−01
rank 1 7 6 5 3 2 4
Scaffer2 mean 4.34E+01 6.99E+01 6.61E+01 8.65E+01 6.59E+01 4.87E+01 4.81E+01
std 4.00E+00 4.07E+00 3.84E+00 4.90E+00 4.64E+00 4.72E+00 4.87E+00
rank 1 6 5 7 4 3 2
Ackley mean 2.95E+00 6.22E+00 6.69E+00 6.29E+00 3.18E+00 3.29E+00 3.17E+00
std 2.37E−01 3.07E−01 2.05E−01 2.01E−01 2.10E−01 2.43E−01 2.27E−01
rank 1 5 7 6 3 4 2
weierstrass mean 9.16E+00 2.04E+01 2.16E+01 2.09E+01 9.47E+00 1.01E+01 1.24E+01
std 5.59E−01 1.03E+00 6.63E−01 7.98E−01 7.45E−01 7.62E−01 6.75E−01
rank 1 5 7 6 2 3 4
HappyCat mean 1.95E−01 2.26E−01 2.14E−01 2.03E−01 9.70E−02 1.31E−01 1.67E−01
std 4.15E−02 4.57E−02 4.34E−02 1.16E−01 2.75E−02 2.86E−02 3.78E−02
rank 4 7 6 5 1 2 3
HGBat mean 1.87E−01 3.07E−01 3.45E−01 7.15E−01 2.39E−01 2.50E−01 1.04E−01
std 7.44E−02 1.23E−01 1.46E−01 1.30E+00 1.07E−01 6.07E−01 4.05E−02
rank 2 5 6 7 3 4 1
Scaffer’s F6 mean 2.91E−03 4.13E−02 4.95E−02 4.25E−02 5.48E−03 5.36E−03 2.45E−02
std 7.45E−04 5.53E−03 3.03E−03 3.94E−03 1.46E−03 1.07E−03 2.84E−03
rank 1 5 7 6 3 2 4
NCRastrigin mean 4.73E+01 4.89E+01 2.13E+02 4.05E+02 1.73E+02 4.85E+01 1.74E+02
std 8.23E+00 1.07E+00 1.90E+01 2.47E+01 2.41E+01 1.56E+00 2.50E+01
rank 1 3 6 7 4 2 5
step mean 7.36E+00 2.81E+01 3.13E+01 2.93E+01 1.04E+01 1.08E+01 9.79E+00
std 8.02E−01 2.58E+00 1.10E+00 1.46E+00 8.97E−01 9.90E−01 1.12E+00
rank 1 5 7 6 3 4 2
Noise quartic mean 6.01E−01 6.21E−01 5.94E−02 5.08E−01 5.90E−03 1.11E−02 4.16E−02
std 2.71E−01 2.96E−01 7.67E−03 3.29E−01 2.45E−03 3.82E−03 8.54E−03
rank 6 7 4 5 1 2 3

As shown in Tables 5, 6, 7, RLDE maintains a stable leading position in both unimodal and multimodal functions, and only ranks slightly lower in a very small number of functions. From the perspective of dimensional variation patterns, as the dimension increases from 10 to 50D, the mean value of function calculations (mean) of all algorithms generally shows an upward trend. However, the mean growth rate of the RLDE algorithm is significantly lower than that of other algorithms, and it always remains at a low level. Other algorithms only show advantages in specific functions: for example, SHADE ranks first in the 30D HGBat function, LSHADE ranks first in the 50D HGBat function, BOA ranks first in the 10D HGBat function, and JADE performs best in the 10D Noise quartic function. This may be due to the adaptability between the search strategies of these algorithms and the characteristics of specific functions (such as the nonlinear distribution of HGBat and the noise sensitivity of Noise quartic). However, this adaptability lacks universality-when other functions are used or the dimension is increased, the advantage disappears. In addition, a special case occurs in the 10D Exponential function: RLDE, DE, JADE, and SHADE are tied for rank 2.5, and there is no algorithm with a single advantage. It is inferred that the reason is that the value-taking characteristics of the function itself reduce the performance difference between algorithms, allowing most algorithms to approach the optimal solution, making it difficult to widen the gap.

Table 8 presents the results of the Wilcoxon signed-rank test conducted on 26 functions across 3 dimensions (10D, 30D, 50D), with the significance level set to α = 0.05. The table also lists the win, tie, and loss counts of the RLDE algorithm compared with each competing algorithm (at the bottom of the table), where the symbols “ +”, “≈”, and “ − ” indicate that RLDE is superior to, equivalent to, and inferior to the competing algorithm, respectively. The summary at the bottom of the table is presented in the form of “win | tie | loss”. An analysis of the comparison results shows that the RLDE algorithm performs excellently across all 3 dimensions. As shown in Table 8, the RLDE algorithm demonstrates significant advantages under different dimensions:in the 10D scenario, it accumulates 142 wins, 11 ties, and 2 losses. Among these, the win rate against PSO and BOA is the highest; there is only 1 loss each when competing against JADE and SHADE, and ties are concentrated in a small number of matchups. The overall winning trend is obvious, with only a very small number of non-winning cases. In the 30D scenario, the performance of the RLDE algorithm becomes more stable, accumulating 147 wins, 9 ties, and 0 losses. Compared with the 10D scenario, losses disappear completely, the number of wins increases by 5, and the number of ties decreases by 2. There are no losses in all matchups, the coverage of wins is wider, and ties only occur in small numbers in individual matchups, showing a more robust overall performance. In the high-dimensional 50D scenario, the RLDE algorithm still maintains its strength, accumulating 147 wins, 9 ties, and 0 losses—exactly the same as the win-tie-loss counts in the 30D scenario. Even in the high-dimensional space where the algorithm performance requirements are higher, there are still no losses, and the number of wins and ties remains at a stable level.

Table 8.

Wilcoxon Signed-Rank Test Under Different Function Tests.

function DE PSO BOA JADE SHADE LSHADE
Bent Cigar +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
sphere +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Elliptic +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Schwefel 1.2 −/+/+ +/+/+ +/+/+ +/+/+ −/+/+ +/+/+
Schwefel 2.21 +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Schwefel 2.22 +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Discus +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Sum of Different Power +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Sum Squares +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Different Powers +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Exponential ≈/+/+ +/+/+ +/+/+ ≈/+/+ ≈/+/+ +/+/+
Zakharov +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Rosenbrock +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Griewank +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Rastrigin +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Apline +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Salomon +/+/+ +/+/+ +/+/+ +/+/+ −/+/+ +/+/+
Scaffer2 +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Ackley +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Weierstrass +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
HappyCat +/+/+ +/+/+ +/+/+ +/+/− +/+/− +/+/−
HGBat +/+/+ +/+/+ −/+/+ −/−/+ −/−/+ −/−/−
Scaffer’s F6 +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
NCRastrigin +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Step +/+/+ +/+/+ +/+/+ +/+/+ +/+/+ +/+/+
Noise quartic +/−/+ −/−/− +/−/− −/−/− −/−/− −/−/−
Wilcoxon−10 24|1|1 25|1|0 25|1|0 23|2|1 21|4|1 24|2|0
Wilcoxon−30 25|1|0 25|1|0 25|1|0 24|2|0 24|2|0 24|2|0
Wilcoxon−50 26|0|0 25|1|0 25|1|0 24|2|0 24|2|0 23|3|0

In conclusion, the RLDE algorithm maintains a high win rate and low loss rate against all competing algorithms in the 10D, 30D, and 50D scenarios. Moreover, its performance becomes more stable as the dimension increases, with only occasional fluctuations in a very small number of functions. This fully demonstrates that RLDE has significant and stable performance advantages across different dimensions and test functions.

Convergence analysis

To analyze the convergence speed and final convergence performance of each algorithm, the experiment took functions such as bent_cigar, sphere, elliptic, Rastrigin, Ackley, and Weierstrass as examples, and obtained iteration trend graphs under 10-dimensional, 30-dimensional, and 50-dimensional conditions respectively. The results are shown in Figs. 8, 9, and 10.

Fig. 8.

Fig. 8

Iteration Curves of Functions in 10 Dimensions.

Fig. 9.

Fig. 9

Iteration Curves of Functions in 30 Dimensions.

Fig. 10.

Fig. 10

Iteration Curves of Functions in 50 Dimensions.

As can be seen from Figs. 8, 9, and 10, under the 10-dimensional (10D), 30-dimensional (30D), and 50-dimensional (50D) conditions of the three types of test functions (Ackley, Bent Cigar, and Elliptic), the RLDE algorithm exhibits significantly superior convergence performance to competing algorithms such as DE, PSO, BOA, JADE, SHADE, and LSHADE—and its advantage becomes more prominent as the dimension increases.

The Ackley function has a flat central region, which places high demands on the algorithm’s ability to explore small gradients. From 10 to 50D, RLDE converges at the fastest speed, and its optimal value is also much lower than that of other algorithms. In the 50D high-dimensional scenario, the convergence of PSO and BOA stagnates significantly, while RLDE still continues to approach the optimal solution.The Bent Cigar function tends to cause algorithmic search stagnation. At 10D, the performances of SHADE and RLDE are close; however, at 30D and 50D, the optimal value of RLDE is significantly better than that of DE and PSO, avoiding the convergence bottlenecks faced by other algorithms.The Elliptic function places high demands on the algorithm’s optimization accuracy. RLDE converges first under all dimensions: at 10D, it has already outperformed DE and BOA; at 30D and 50D, the optimal values of PSO and BOA remain at a higher order of magnitude, while RLDE still maintains a low optimal value and fast convergence speed.Due to its fluctuating characteristics, the Weierstrass function places high demands on algorithm stability. At 10D, the optimal value (best value) of RLDE is already lower than that of other algorithms; at 30D and 50D, the optimal values of PSO and BOA remain above 100, while RLDE still maintains a low optimal value and continues to converge.The Rastrigin function has multiple local optima, which easily cause the algorithm to fall into stagnation. At 10D, the optimal value of RLDE reaches the order of 10⁻4, far exceeding other algorithms; at 30D and 50D, the convergence of DE and PSO stagnates, while RLDE still steadily approaches the optimal solution.The Sphere function is a unimodal function that tests the algorithm’s convergence accuracy. At 10D, the optimal value of RLDE is lower than that of JADE and SHADE; at 30D and 50D, the optimal values of other algorithms rise above 101, while RLDE still maintains its advantage.

In summary, RLDE exhibits better convergence speed and solution stability under test functions with different characteristics and different dimensions—especially in high-dimensional scenarios, where its advantage is more prominent.

Stability analysis

To further visually present the distribution characteristics of each algorithm in the solution process and examine the stability of the algorithms, boxplots of the mean values of all algorithms under different functions were plotted respectively, and the results are shown in Figs. 1113.

Fig. 12.

Fig. 12

Boxplots of Different Algorithms Under 30-Dimensional Condition.

Fig. 11.

Fig. 11

Boxplots of Different Algorithms Under 10-Dimensional Condition.

Fig. 13.

Fig. 13

Boxplots of Different Algorithms Under 50-Dimensional Condition.

From the perspective of the median and box width of the boxplots, in the Ackley function, the median of RLDE’s box is much lower than that of other algorithms and the box is narrower, which indicates that its typical optimization results are better and the experimental repeatability is good; the Bent Cigar function tends to cause algorithm stagnation, yet RLDE’s median is always at a lower order of magnitude and its box is compact (with no obvious outliers), demonstrating its stable optimization capability against stagnation; the Elliptic function has strict requirements on accuracy, and RLDE’s median is significantly lower than that of competing algorithms while its box is narrow, verifying its advantage in precise optimization. For the Weierstrass function, which fluctuates frequently and has high requirements on algorithm stability, the median of RLDE’s box is much lower than that of algorithms such as DE and PSO and the box is narrower, showing that its typical optimization results are more reliable and the experimental repeatability is good; the Rastrigin function has multiple local optima that easily trap the algorithm in stagnation, but RLDE’s median is always at a lower order of magnitude and its box is compact (with no obvious outliers), reflecting its stable optimization capability against local optima; as for the Sphere function, a unimodal function with strict requirements on optimization accuracy, RLDE’s median is significantly lower than that of competing algorithms and its box is narrow, which confirms its advantage in precise optimization. In summary, the boxplots of RLDE consistently show the characteristics of a low median and a narrow box, with obvious advantages in all scenarios.

Ablation experiment

To verify the effectiveness of the Halton initialization mechanism, reinforcement learning-based parameter control mechanism, and hierarchical mutation mechanism in the improved algorithm proposed in this paper, an ablation experiment was designed: the Halton initialization mechanism was removed separately (denoted as RLDE-a), the reinforcement learning-based parameter control mechanism was removed separately (denoted as RLDE-b), and the hierarchical mutation mechanism was removed separately (denoted as RLDE-c). The mean values and standard deviations of each ablation version and the original algorithm in solving 6 functions including the Ackley function under the 10-dimensional condition were compared, and the results are shown in Tables 9.

Table 9.

Table of function runtime result values under 10-dimensional condition.

Function Index RLDE RLDE−a RLDE−b RLDE−c DE
Ackley mean 1.32E−02 1.50E−02 1.34E+00 9.95E−01 2.22E+00
std 3.43E−03 6.02E−03 1.27E−01 4.44E−01 4.01E−01
bent_cigar mean 1.37E+03 2.02E+03 1.99E+06 1.67E+04 1.98E+07
std 6.88E+02 1.38E+03 8.78E+06 6.40E+04 6.84E+06
elliptic mean 1.90E+00 2.63E+00 2.04E+03 1.35E+02 2.49E+03
std 8.27E−01 3.63E+00 7.68E+02 9.20E+02 1.11E+03
Rastrigin mean 5.04E−04 9.24E−04 0.43E+00 2.38E−01 2.32E+00
std 4.58E−04 3.31E−04 9.54E−01 8.33E−01 1.26E+00
sphere mean 9.37E−04 5.92E−04 4.73E−01 2.04E−02 4.90E+00
std 5.52E−04 3.57E−04 1.50E−01 1.34E−02 2.16E+00
weierstrass mean 1.07E−01 1.43E−01 0.53E+00 6.19E−01 1.54E+00
std 2.59E−02 3.04E−02 2.03E−01 1.47E−01 1.86E−01

As can be seen from Table 9, the original RLDE algorithm is significantly superior to the three ablation variants (RLDE-a, RLDE-b, RLDE-c) and the traditional DE algorithm in terms of mean value and stability, which verifies the effectiveness of each designed module in RLDE. The role of each module can be further analyzed based on the performance of the ablation variants: For RLDE-a, its mean value is the closest to that of the original RLDE. For example, in the Ackley function, the value of RLDE is 1.32E-02 and that of RLDE-a is 1.50E-02; in the bent_cigar function, the values are 1.37E+03 and 2.02E+03 respectively, with only a slight increase; there is also no obvious fluctuation in the standard deviation. This indicates that although the Halton initialization can optimize the initial population distribution and lay a foundation for subsequent searches, its impact on the final performance of the algorithm is relatively limited.

For RLDE-b, the performance degradation is the most severe: the mean value of all functions increases by orders of magnitude. For instance, in the bent_cigar function, the value surges from 1.37E+03 to 1.99E+06, and in the elliptic function, it rises from 1.90E+00 to 2.04E+03; moreover, the standard deviation is much higher than that of other variants. This proves that the reinforcement learning-based parameter control mechanism is the core module of RLDE. By dynamically adjusting the algorithm parameters, it can significantly improve the solution accuracy and stability. Without this mechanism, the algorithm’s performance is even close to that of the traditional DE. For RLDE-c, the performance degradation is significantly greater than that of RLDE-a. For example, in the Rastrigin function, the mean value increases from 5.04E-04 to 2.38E-01, and in the weierstrass function, it rises from 1.07E-01 to 6.19E-01; in addition, the standard deviation generally expands. This shows that the hierarchical mutation mechanism can effectively prevent the algorithm from falling into local optima by maintaining population diversity and balancing global exploration and local exploitation, making it an important module to ensure the search efficiency of RLDE.

In summary, the order of importance of the three designed modules in RLDE is: reinforcement learning-based parameter control > hierarchical mutation > Halton initialization. The synergistic effect of the three enables RLDE to exhibit optimal performance in 10-dimensional complex optimization problems.

Application of RLDE in UAV task assignment

The engineering practical value of an optimization algorithm needs to be verified through specific scenarios. For example, some studies have proposed a Momentum Recurrent Neural Network (MRNN) combined with a cooperative neural dynamic optimization framework, which successfully solves the sparse motion planning problem of redundant manipulators. The core lies in improving task execution accuracy and efficiency through optimization algorithms30. In this paper, the improved Differential Evolution algorithm (RLDE) is applied to the Unmanned Aerial Vehicle (UAV) task assignment problem, which also falls into the category of “using optimization algorithms to solve complex engineering task planning” and can further verify the engineering value of such algorithms in multi-agent collaborative tasks.

The UAV task assignment problem refers to how to efficiently assign tasks to UAVs in a multi-UAV system to ensure the efficient completion of tasks. This problem has wide applications in fields such as logistics and distribution31, search and rescue32, agricultural plant protection33, and environmental monitoring34. With the rapid development of UAV technology, the UAV task assignment problem has gradually become a research hotspot, especially in scenarios involving multi-UAV collaboration and complex military missions. In this section, the proposed Differential Evolution algorithm based on reinforcement learning (RLDE) is applied to the UAV task assignment problem.

Problem description

Assume there are n UAVs and m tasks, and the tasks need to be assigned to different UAVs to achieve a certain optimization goal, such as maximizing the total profit or minimizing the total cost. It is assumed that each UAV carries multiple weapons and can conduct continuous strikes on multiple ground targets; within the effective operating time and area of the UAVs, these n UAVs arrive at the attack area simultaneously at a certain moment, and each UAV can only perform one task.

To represent the allocation relationship between UAVs and tasks, a two-dimensional decision matrix X of size n × m is defined, where each element xij is a binary variable:

graphic file with name d33e7115.gif 16

The two-dimensional decision matrix is expressed as follows:

graphic file with name d33e7123.gif 17

In Eq. (17), i = 1, 2, …, n denotes the index of the UAVs, and j = 1,2,…,m represents the index of the tasks. Each row of the matrix indicates the allocation status of a UAV, while each column reflects the UAV to which a task is assigned.

Objective functions

In this paper, three different forms of objective functions are designed, namely maximizing total revenue, minimizing total cost, and maximizing comprehensive benefit.

a. Maximizing total revenue.

The total revenue is expressed as:

graphic file with name d33e7159.gif 18

In Eq. (18), rj represents the revenue from completing task j. This objective aims to maximize the total revenue by adjusting the value of xij.

b. Minimizing total cost.

The total cost is expressed as:

graphic file with name d33e7189.gif 19

In Eq. (19), cij denotes the cost required for the i-th UAV to complete the j-th task. This objective aims to minimize the total cost by adjusting the value of xij.

c. Maximizing the comprehensive benefit value.

This objective aims to obtain the maximum revenue while incurring the minimum loss, which is expressed as:

graphic file with name d33e7216.gif 20

Constraints

A. Each UAV can only be assigned one task.

Each UAV can perform only one task at a time; therefore, for each UAV i, the following condition holds:

graphic file with name d33e7231.gif 21

b. Each target can be assigned to multiple UAVs or not assigned to any UAV, with no restrictions imposed.

Experimental setup

In the experiment, it is assumed that our side has n UAVs and m tasks, where the n UAVs are denoted as U = (u1, u2, u3, …, un) and the m tasks are denoted as T = (t1, t2, t3, , tn) . the revenue and cost for UAVs to complete different tasks are were obtained from experts in the relevant field.

To test the performance of the RLDE algorithm proposed in this paper in UAV task assignment, the experiment selects other classic meta-heuristic algorithms for comparative experiments, including Differential Evolution (DE), Particle Swarm Optimization (PSO), Butterfly Optimization Algorithm (BOA), and three other Differential Evolution variants (JADE, SHADE, LSHADE). The parameters of each algorithm and the experimental hardware settings are the same as those in Chapter 4.

Experimental results and analysis

The experimental results are evaluated from the following aspects: (1) For the three different objective functions in the UAV task assignment problem, the mean and standard deviation are used to reflect the performance of each optimization method. When the objective function is profit or comprehensive benefit, a larger mean indicates a better result; when the means are equal, a smaller standard deviation means stronger algorithm stability and a better calculation result. When the objective function is cost, a smaller mean indicates a better result; when the means are equal, a smaller standard deviation indicates stronger algorithm stability and a better calculation result. (2) Iteration trend graphs: These graphs display the optimization process of each algorithm for each objective function in the form of curves.

Tables 10, 11, and 12 present the mean and standard deviation of the optimal solutions obtained from 30 independent runs of different algorithms under various UAV/target quantity combinations, targeting the three objective functions of “maximizing profit”, “minimizing cost”, and “maximizing comprehensive benefit”. Figure 14 shows the task assignment matrices of the proposed algorithm (RLDE) corresponding to the optimization objectives of “maximizing profit”, “minimizing cost”, and “maximizing comprehensive benefit” in the scenario with 20 UAVs and 20 tasks. Figure 15 shows a histogram comparing the costs of different algorithms under different UAV and task quantities.

Table 10.

Result of Maximum Benefit Value Function.

UAV/Task Combination Algorithm Mean Value Standard Deviation

Uav = 5

Task = 10

RLDE 46.86 0.74
DE 47.21 0.39
PSO 46.03 1.11
BOA 45.75 1.1
JADE 47.18 0.12
SHADE 47.21 0.92
LSHADE 47.12 0.29

Uav = 10

Task = 20

RLDE 92.03 1.26
DE 87.46 1.36
PSO 88.36 2.25
BOA 82.21 2.58
JADE 88.67 2.04
SHADE 90.4 1.35
LSHADE 89.87 1.77

Uav = 20

Task = 20

RLDE 178.15 2.71
DE 162.69 2.76
PSO 172.19 4.52
BOA 153.58 4.09
JADE 171.59 3.84
SHADE 168.9 8.07
LSHADE 170.51 3.1

Table 11.

Result of minimum cost value function.

UAV/Task Combination Algorithm Mean Value Standard Deviation

Uav = 5

Task = 10

RLDE 4.42 0.33
DE 4.27 0.51
PSO 4.92 0.62
BOA 7.44 0.87
JADE 4.28 0.06
SHADE 4.28 0.08
LSHADE 4.36 0.07

Uav = 10

Task = 20

RLDE 8.28 0.38
DE 10.3 0.52
PSO 10.69 1.07
BOA 18.1 2.16
JADE 8.96 0.62
SHADE 9.29 1.38
LSHADE 8.35 0.6

Uav = 20

Task = 20

RLDE 18.31 1.11
DE 26.93 1.58
PSO 22.01 1.44
BOA 40.12 3.07
JADE 20.89 1.29
SHADE 23.97 4.12
LSHADE 21.26 1.53

Table 12.

Result values of maximum comprehensive benefit value function.

UAV/Task Combination Algorithm Mean Value Standard Deviation

Uav = 5

Task = 10

RLDE 35.11 0.29
DE 35.04 0.02
PSO 33.63 1.57
BOA 35.61 0.08
JADE 35.69 0.13
SHADE 35.57 0.39
LSHADE 35.72 0.38

Uav = 10

Task = 20

RLDE 71.38 1.91
DE 65.86 2.25
PSO 65.12 4.29
BOA 68.08 3.67
JADE 69.56 5.64
SHADE 68.51 1.99
LSHADE 68.09 1.69

Uav = 20

Task = 20

RLDE 133.05 3.66
DE 116.52 10.01
PSO 124.37 6.45
BOA 124.47 6.58
JADE 128.51 3.63
SHADE 123.31 3.76
LSHADE 125.41 3.71

Fig. 14.

Fig. 14

Task Assignment Matrix of RLDE When UAV = 20 and Task = 20.

Fig. 15.

Fig. 15

Comparative Histogram of Minimum Cost Results Among Different Algorithms.

Mean and Std represent the mean and standard deviation of each algorithm’s fitness, respectively. The experimental results show that the improved Differential Evolution algorithm (RLDE) proposed in this paper exhibits significant superiority:

Under the maximizing profit objective, the algorithm has a medium-level mean in small-scale scenarios (UAVs = 5/Tasks = 10) but maintains good stability; in medium-scale (UAVs = 10/Tasks = 20) and large-scale (UAVs = 20/Tasks = 20) scenarios, its mean values reach 92.03 and 178.15, respectively, both leading other algorithms.

Under the minimizing cost objective, the algorithm’s mean values in medium and large-scale scenarios are 8.28 and 18.31, respectively—the lowest among all algorithms, and far superior to other algorithms such as Differential Evolution (DE) and Butterfly Optimization Algorithm (BOA).

Under the maximizing comprehensive benefit objective, the algorithm also maintains a leading mean in medium and large-scale scenarios.

In addition, in all scenarios, the standard deviation of RLDE (the improved Differential Evolution algorithm proposed in this paper) is better than that of most algorithms. This not only ensures the algorithm’s high performance but also enables it to maintain strong stability. Its advantage becomes even more prominent, especially when the scale of UAVs and tasks expands.

Meanwhile, for these three types of objective functions, the experiment also plots iteration trend graphs of RLDE, Differential Evolution (DE), Particle Swarm Optimization (PSO), Butterfly Optimization Algorithm (BOA), and other Differential Evolution variants (JADE, SHADE, LSHADE) under different UAV/target quantity combinations. These iteration trend graphs are shown in Figs. 1618. It can be observed from the figures that the proposed algorithm has certain advantages in terms of convergence speed and accuracy, and its performance is superior to traditional heuristic algorithms, especially in high-dimensional problems.

Fig. 16.

Fig. 16

Iteration Curve of Maximum Benefit.

Fig. 18.

Fig. 18

Iteration Curve of Maximum Comprehensive Benefit.

Based on the iteration curves in Fig. 16 (maximizing profit), Fig. 17 (minimizing cost), and Fig. 18 (maximizing comprehensive benefit), the superiority of the RLDE algorithm is reflected in different UAV/task scenarios, and its performance also exhibits scenario adaptability characteristics. The specific observations are as follows:

Fig. 17.

Fig. 17

Iteration Curve of Minimum Cost.

In the profit maximization iteration process (Fig. 9): In the small-scale scenario (5/10, i.e., 5 UAVs/10 tasks), the converged value of RLDE is at a medium level, but the curve is stable without large fluctuations; in the medium-scale (10/20, i.e., 10 UAVs/20 tasks) and large-scale (20/20, i.e., 20 UAVs/20 tasks) scenarios, RLDE not only has significantly higher converged profit values (approximately 92 and 178, respectively) than competing algorithms such as Differential Evolution (DE) and Particle Swarm Optimization (PSO), but also achieves faster convergence speed—it enters the stable stage earlier and avoids profit fluctuations in the later iteration stage.

In the cost minimization iteration process (Fig. 10): RLDE shows absolute advantages in all scenarios—its converged cost values (approximately 8.3 in the medium-scale scenario and 18.3 in the large-scale scenario) are the lowest among all algorithms, far lower than that of the Butterfly Optimization Algorithm (BOA, approximately 40 in the large-scale scenario); in addition, the fluctuation range of RLDE’s iteration curve is small, and its stability is superior to that of Particle Swarm Optimization (PSO, with a cost fluctuation of more than 1 in the medium-scale scenario), demonstrating its efficiency and stability in cost control.

In the comprehensive benefit maximization iteration process (Fig. 11): In the small-scale scenario, RLDE’s converged value is slightly inferior to that of Differential Evolution (DE) and JADE, but its curve stability is better; in the medium and large-scale scenarios, its converged value is ahead of that of SHADE and LSHADE, and its convergence speed is faster, achieving the dual goals of “high comprehensive benefit” and “fast and stable convergence”. Especially when the scale of tasks and UAVs expands, its advantage becomes more prominent, enabling it to meet the needs of complex scenarios.

The above experiments show that compared with traditional Differential Evolution algorithms and other competing algorithms, the improved Differential Evolution algorithm (RLDE) proposed in this paper can find better optimal solutions when the number of tasks is large, and has excellent convergence performance and global optimization capability.

Although RLDE has achieved ideal results in experiments, combined with the No Free Lunch Theorem (NFL Theorem), RLDE still has limitations in practical engineering applications. First, it has high parameter sensitivity—parameters such as the learning rate α and discount factor γ of the policy network, and the maximum number of policy training epochs (Trainmax) require a large number of pre-experiments for tuning. There is no universal criterion, leading to high engineering debugging costs; improper training epochs will also affect efficiency and performance. Second, its performance degrades in high-dimensional problems: the time complexity increases linearly with dimension n, the high-dimensional uniformity of the Halton sequence decreases, the overhead of hierarchical mutation surges, and the fixed adjustment steps of parameters F and CR are difficult to adapt to high-dimensional requirements. Third, its applicable scenarios are limited: the hierarchical mutation strategy is designed for continuous weak multimodal problems, and has poor adaptability to discrete or strong multimodal problems in engineering, which restricts its application scope.

Conclusion

This paper proposes an improved Differential Evolution algorithm (RLDE). First, the Halton sequence is used to initialize the population. To address the issue that the key parameters of the Differential Evolution algorithm need to be manually specified, reinforcement learning is introduced; the key parameters are adaptively adjusted through a policy gradient network. Additionally, a hierarchical mechanism is introduced based on differences in fitness values, and two different mutation strategies—DE/rand/1 and DE/rand/best—are applied to individuals at different levels.

The algorithm was tested through optimization calculations using 26 standard test functions, and comparisons were conducted with other algorithms under 10, 30, and 50 dimensions respectively. The results show that compared with traditional heuristic algorithms, the proposed algorithm exhibits faster convergence speed, higher convergence accuracy, and better global convergence capability, and performs well in the tested functions.

On this basis, with the UAV task assignment problem as the focus, a problem model was constructed based on three types of objectives: maximizing profit, minimizing cost, and maximizing comprehensive benefit. Experiments were then conducted using the proposed algorithm and other methods. The experimental results indicate that when the number of UAVs and tasks increases, the proposed algorithm achieves better performance compared with other algorithms. This method avoids the shortcomings of traditional Differential Evolution and has certain engineering value. In the future, the application of this algorithm in various optimization fields can be explored.

Author contributions

Guangwei Yang conducted algorithm design and manuscript writing, Peng Sun and Jieyong Zhang performed logical organization and revision of the content. Yongzhuang Zhang and Tianxin Li completed the drawing of some figures and tables in the paper.

Funding

This work was not funded.

Data availability

All data generated or analysed during this study are included in this published article and its supplementary information files.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zhang, Z. et al. Tackling dual-resource flexible job shop scheduling problem in the production line reconfiguration scenario: An efficient meta-heuristic with critical path-based neighborhood search. Adv. Eng. Inform.65, 103282 (2025). [Google Scholar]
  • 2.Liu, M. et al. A distributed competitive and collaborative coordination for multirobot systems. IEEE Trans. Mob. Comput.23(12), 11436–11448 (2024). [Google Scholar]
  • 3.Dchahar, V., Katoch, S. & Chauhan, S. S. A review on genetic algorithm: Past, present, and future. Multimedia Tools and Applications80(5), 8091–8126 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Song, B. Y., Wang, Z. D., & Zou, L. An improved PSO algorithm for smooth path planning of mobile robots using continuous high-degree Bezier curve. Applied Soft Computing Journal (2021).
  • 5.Li, X., Zhang, S. & Shao, P. Discrete artificial bee colony algorithm with fixed neighborhood search for traveling salesman problem. Eng. Appl. Artif. Intell.131, 107816 (2024). [Google Scholar]
  • 6.Shi, K. et al. Dynamic path planning of mobile robot based on improved simulated annealing algorithm. J. Franklin Inst.360(6), 4378–4398 (2023). [Google Scholar]
  • 7.Zhang, Z. et al. A hybrid biogeography-based optimization algorithm to solve high-dimensional optimization problems and real-world engineering problems. Appl. Soft Comput.144, 110514 (2023). [Google Scholar]
  • 8.Zhang, Z. & Gao, Y. Solving large-scale global optimization problems and engineering design problems using a novel biogeography-based optimization with Lévy and Brownian movements. Int. J. Mach. Learn. Cybern.14(1), 313–346 (2023). [Google Scholar]
  • 9.Zhang, Z., Gao, Y., & Guo, E. A supercomputing method for large-scale optimization: A feedback biogeography-based optimization with steepest descent method. Journal of Supercomputing, 79(2) (2023).
  • 10.Fan, J., Jin, L., Li, P., et al. Coevolutionary neural dynamics considering multiple strategies for nonconvex optimization. Tsinghua Science and Technology. 10.26599/TST.2025.9010120 (2025).
  • 11.Xue, J. K. & Shen, B. A novel swarm intelligence optimization approach: Sparrow search algorithm. Systems Science & Control Engineering8(1), 22–34 (2020). [Google Scholar]
  • 12.Jiang, R. Y., Yang, M., Wang, S. Y., & Chao, T. An improved whale optimization algorithm with armed force program and strategic adjustment. Applied Mathematical Modelling (2020).
  • 13.Siva Shankar, G., & Manikandan, K. Diagnosis of diabetes diseases using optimized fuzzy rule set by grey wolf optimization. Pattern Recognition Letters (2019).
  • 14.Arora, S. & Singh, S. Butterfly optimization algorithm: A novel approach for global optimization. Soft. Comput.25, 715–734 (2019). [Google Scholar]
  • 15.Azizi, M. Atomic orbital search: A novel metaheuristic algorithm. Applied Mathematical Modelling, 93(1) (2021).
  • 16.Storn, R. & Price, K. Differential evolution: A simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim.11(4), 341–359 (1997). [Google Scholar]
  • 17.Song, Y. et al. Dynamic hybrid mechanism-based differential evolution algorithm and its application. Expert Syst. Appl.213, 118834 (2023). [Google Scholar]
  • 18.Deng, W. et al. An adaptive differential evolution algorithm based on belief space and generalized opposition-based learning for resource allocation. Appl. Soft Comput.127, 109419 (2022). [Google Scholar]
  • 19.Zeng, Z. et al. Improved differential evolution algorithm based on the sawtooth-linear population size adaptive method. Inf. Sci.608, 1045–1071 (2022). [Google Scholar]
  • 20.Chai, X. et al. Multi-strategy fusion differential evolution algorithm for UAV path planning in complex environment. Aerosp. Sci. Technol.121, 107287 (2022). [Google Scholar]
  • 21.Zhou, J. et al. Parameters identification of photovoltaic models using a differential evolution algorithm based on elite and obsolete dynamic learning. Appl. Energy314, 118877 (2022). [Google Scholar]
  • 22.Yin, S. et al. Reinforcement-learning-based parameter adaptation method for particle swarm optimization. Complex & Intelligent Systems9(5), 5585–5609 (2023). [Google Scholar]
  • 23.Chen, R. et al. A self-learning genetic algorithm based on reinforcement learning for flexible job shop scheduling problem. Comput. Ind. Eng.149, 106778 (2020). [Google Scholar]
  • 24.Liu, M. et al. Activated gradients for deep neural networks. IEEE Transactions on Neural Networks and Learning Systems34(4), 2156–2168 (2021). [DOI] [PubMed] [Google Scholar]
  • 25.Yue, C., Shen, Y., Liang, J., et al. Hierarchical genetic algorithm for the multi-solution traveling salesman problem. IEEE Transactions on Evolutionary Computation (2025).
  • 26.Jin, L., Wei, L. & Li, S. Gradient-based differential neural-solution to time-dependent nonlinear optimization. IEEE Trans. Autom. Control68(1), 620–627 (2022). [Google Scholar]
  • 27.Zhang, J. & Sanderson, A. C. JADE: Adaptive differential evolution with optional external archive. IEEE Trans. Evol. Comput.13(5), 945–958 (2009). [Google Scholar]
  • 28.Tanabe, R., & Fukunaga, A. Success-history based parameter adaptation for differential evolution. In 2013 IEEE Congress on Evolutionary Computation (pp. 71–78). IEEE (2013).
  • 29.Tanabe, R., & Fukunaga, A. S. Improving the search performance of SHADE using linear population size reduction. In 2014 IEEE Congress on Evolutionary Computation (CEC) (pp. 1658–1665). IEEE (2014).
  • 30.Huang, H., Jin, L., & Zeng, Z. A momentum recurrent neural network for sparse motion planning of redundant manipulators with majorization-minimization. IEEE Transactions on Industrial Electronics (2025).
  • 31.Primatesta, S. Comprehensive task optimization architecture for urban UAV-based intelligent transportation system. Drones, 8. 10.3390/drones8090473 (2024).
  • 32.Zhu, J. et al. Unmanned aerial vehicle computation task scheduling based on parking resources in post-disaster rescue. Appl. Sci.10.3390/app13010289 (2022). [Google Scholar]
  • 33.Ompusunggu, V. M. M. O., Hardhienata, M. K. D., & Priandana, K. Application of ant colony optimization for the selection of multi-UAV coalition in agriculture. In 2020 International Conference on Smart City and Intelligent Systems (ICOSICA) (pp. 1–5). IEEE. 10.1109/ICOSICA49951.2020.9243226 (2020).
  • 34.Yan, H., Zhao, W., Chen, C., et al. MCTA: Multi-UAV collaborative target allocation to monitor targets with dynamic importance. In 2020 International Conference on Big Data and Artificial Intelligence (BigDIA) (pp. 1–6). IEEE. 10.1109/BigDIA51454.2020.00017 (2020).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analysed during this study are included in this published article and its supplementary information files.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES