Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 5;16:7403. doi: 10.1038/s41598-026-38269-1

NeuroAction: a neuroevolutionary approach to reinforcement learning for autonomous vehicles

Esther Aboyeji 1, Oladayo S Ajani 1,, Ivan Fenyom 1, Rammohan Mallipeddi 1,
PMCID: PMC12929635  PMID: 41644792

Abstract

End-to-end or Deep Reinforcement Learning-based control of autonomous vehicles generally leverages a sequence of decoupled perception-action protocols. One main limitation of such frameworks is the required backpropagation algorithm to optimize the underlying mapping function or policy network. This is because although the learning goal usually involves several objectives, they must be aggregated to realize a single objective loss utilized by the backpropagation algorithm. This also limits the preference-based driving behavior from a user perspective. To overcome these challenges, we present NeuroAction—a multi-objective neuroevolutionary method designed for reinforcement learning-based autonomous driving where several goals or objectives can be optimized simultaneously. Specifically, we propose a formulation of reinforcement learning-based control of autonomous vehicles as a multiobjective optimization problem. Consequently, any multiobjective evolutionary algorithm can be used to solve the resulting problem with the aim of generating a Pareto-front of optimal policy networks. In other words, the resulting framework is capable of generating policies that are suitable for providing users with different trade-offs based on their desired driving preferences. We investigated the proposed framework on a benchmark DRL-based autonomous driving task and presented performance evolution based on three different EMO algorithms.

Keywords: Evolutionary multi-objective optimization, Multi-objective reinforcement learning, Neuroevolution

Subject terms: Engineering, Mathematics and computing

Introduction

End-to-end or Deep Reinforcement Learning (DRL)-based methods for autonomous vehicles have gained much traction in the last few years1,2. The goal of the underlying agent or control system is to observe its environment by leveraging different sensory inputs or perception systems and consequently make decisions or perform actions to satisfy predefined driving goals. In other words, the agent maps sensory input (observations) to control output (actions) to satisfy or optimize driving goals formulated as rewards. DRL is a fusion of Deep Neural Network (DNN) and Reinforcement Learning (RL) where the underlying mapping function that maps sensory inputs to driving action in the context of autonomous driving is a DNN generally known as the policy network. To train this network, gradient-based methods are usually employed through the back-propagation algorithm. By leveraging the back-propagation algorithm, all the driving goals of the agent must be formulated into a single loss function (reward function) to be optimized. This makes the formulation of the reward function complex, requiring aggregation schemes that might be biased toward certain driving goals if the weights of the aggregation schemes are not properly chosen3,4.

From a computational point of view, gradient-based DRL methods are susceptible to exploding gradients5, require large memory usage due to the need to store episodes for updates, are not efficient for distributed or parallel computing, and are very sensitive to hyperparameters which can lead to failure if not properly set6,7. To address these challenges, researchers are leveraging black-box optimization of DRL policies through deep neuroevolution facilitated by Evolutionary Algorithms (EAs)6,810. Specifically, EAs are preferred above gradient-based methods in this context because they are faster and memory efficient by only requiring a forward pass, they are highly parallelizable, and less sensitive to hyperparameters11.

From an application point of view, the reliance of classical DRL methods on the back-propagation algorithm and consequently the use of a rather complex reward formulation that aggregates all the driving objectives, eliminates specific user-driven adaptions or goals which are very important in certain driving scenarios. Although there are generic driving goals such as collision avoidance, speed, comfort (stability), etc., there are scenario-based goals that comprise different trade-offs of the generic goals. For example, when driving with a baby on board, human drivers will generally prioritize safety over speed. However, such preference-based driving is difficult to realize through pre-defined aggregated methods. Recently,12 demonstrated experimentally that vehicles with Advanced Driving Systems reported more accidents compared to Human-Driven Vehicles under dawn/dusk or turning conditions. This is generally because human drivers can adapt their driving objectives based on the scenarios within or outside the vehicles. To enable such capabilities in DRL-based autonomous driving, researchers are reformulating the underlying DRL tasks as Multi-Objective Markov Decision Processes (MOMDP) with multiple reward functions that reflect each of the driving objectives13. The goal here is to generate a set of solutions or policies that constitute the so-called Pareto-Front rather than a single optimal policy. Consequently, driving policies can be selected based on specific out and in-car scenarios. However, most of the algorithms proposed within this scope are still gradient-based algorithms and are very complex in terms of implementation.

From the above, it is clear that although traditional multi-objective DRL algorithms can consider different driving objectives, their reliance on the back-propagation algorithm makes them computationally inefficient. Interestingly, although EAs have been applied mainly to deep neuroevolution of DRL with a single aggregated objective function, EAs are also well-suited for Multi-objective optimization14. Recent studies have highlighted the promise of combining DRL and EAs to improve exploration, stability, and overall learning performance in complex control tasks15. Motivated by these findings, employing Evolutionary Multi-objective Optimization (EMO) algorithms to optimize DRL policies through multi-objective neuroevolution will not only allow for multiple driving objectives to be considered simultaneously with a set of optimal solutions generated, but also address the aforementioned computational issues in the context of Pareto-front approximation for multi-objective reinforcement learning. Therefore, this work proposes NeuroAction, a neuroevolutionary multi-objective reinforcement learning framework that transforms DRL-based autonomous driving tasks into multi-objective optimization problems. In NeuroAction, the policy networks of the driving agent are treated as black-boxes, and their weights are optimized through neuroevolution using multiple objectives and evolutionary multi-objective (EMO) algorithms. The suitability of the proposed framework is demonstrated using a well-known DRL-based autonomous driving task. To consider multiple driving objectives, we perform multi-objectivization16 of the aggregated reward and optimize the resulting multi-objective problem accordingly. In terms of optimization, we employed three baseline EMO algorithms from the three classes of EMO algorithms. The results of the experiments conducted show that the proposed framework is capable of generating a diverse set of policies that are trade-offs of the different driving goals, and these policies can be selected to facilitate user-based or scenario-based preferences. The contribution of this paper can be summarized as follows:

  1. We propose a framework that leverages the standard reward decomposition of traditional DRL tasks into multi-objective black-box optimization problems.

  2. The proposed framework, termed NeuroAction, is applied to optimize a well-studied deep reinforcement learning-based driving control environment.

  3. The resulting multi-objective black-box optimization problem is optimized based on three baseline EMO algorithms.

  4. The results show that the proposed framework is capable of generating optimal policies that can be leveraged to facilitate user-preference-based driving behaviors.

This paper is structured as follows: Section II presents background information on Multi-Objective Reinforcement Learning (MORL) and multi-objective neuroevolution. In Section III, the proposed problem definition is detailed. Section IV describes the autonomous driving task and its multi-objectivization. Section V presents and discusses baseline empirical evaluations conducted with three different EMO algorithms. Finally, Section VII presents the conclusions and possible future research directions.

Preliminaries

This section introduces the background on MORL and multi-objective neuroevolution.

Multi-objective reinforcement learning (MORL)

MORL is a multi-objective sequence decision problem that is formally defined as a MOMDP. Generally, a MOMDP is defined by the tuple Inline graphic, where:

  • Inline graphic and Inline graphic represents the state and action space respectively.

  • Inline graphic is the probabilistic transition function.

  • Inline graphic defines the probability distribution over initial states.

  • Inline graphic is the discount factor.

  • Inline graphic is a function that provides real-valued rewards for each of the Inline graphic objectives.

The key distinction between a MOMDP and a standard single-objective MDP (SOMDP) lies in the reward function Inline graphic. Contrary to SOMDP, where the reward is a scalar value, the reward in MOMDP is a vector where each element corresponds to the reward for each objective. Consequently, the reward vector’s length is the same as the number of objectives M. In other words, the interactions between the agent and the environment are the same in both single and multi-objective RL, except that in Single-Objective RL (SORL), the agent’s actions are evaluated in terms of a scalar reward value, while a vector of rewards is used to evaluate the agent in MORL. Comprehensive details on how this difference affects the policies and value functions in both SORL and MORL are presented in17.

Multi-objective neuroevolution

The goal of any MORL algorithm is to find a set of policies (Pareto-front) that maximizes the cumulative or average vector of rewards in a MORL task. Neuroevolution searches for these policies by formulating the search process as a black-box optimization and consequently uses gradient-free algorithms, especially Evolutionary algorithms18 to find these parametrized policies19. In the context of multi-objective DRL, where the policy and other networks that constitute the agent are formalized as neural networks, the aim of neuroevolution is to find a set of parameterized policies or agents that optimize all the objectives simultaneously. In other words, the policy search process is posed as an optimization problem where the weights of the learning networks that constitute the agents are the decision variables and the M objectives to be optimized are the M vector-valued rewards.

Given a reinforcement learning agent with a learning network Inline graphic parameterized by Inline graphic, the goal of neuroevolution is the find the parameters Inline graphic that optimize the reward vector Inline graphic using traditional gradient-free population-based algorithms. To evaluate the policy parameters, a full MDP cycle or episode is rolled out and the cumulative reward vector is taken as the objective value. Algorithm 1 presents a general structure for neuroevolution of Multi-Objective Deep Reinforcement Learning (MODRL), showing the basic evolution process based on Multi-Objective Evolutionary Algorithms.

Algorithm 1.

Algorithm 1

Multi-objective deep neuroevolution for reinforcement learning.

MODRL task generation and multiobjectivization

The learning process in MODRL typically consists of an agent interacting with an environment to optimize a set of rewards earned through its actions. To achieve a full MODRL training cycle, various tools are necessary to define both the environment and the agent (policy network and learning algorithm). In this work, Highway-envs20: a collection of environments for decision-making in Autonomous Driving formalized according to the standard Gym21 structure, is employed. For the agent, we leverage Stable Baselines3 (SB3), which is a widely recognized library of reinforcement learning algorithms and models that integrates smoothly with the standard Gym framework22. Although Highway-envs and SB3 are originally designed for single-objective tasks (with scalarized reward formulation) and DRL algorithms, they are only used in this work to generate the task and the policy network of the agents and initialize the policy network parameters. Consequently, the generated task is transformed into a multiobjective task through the multi-objectivization process explained in Section 3.2. By using Highway-envs and SB3, the approach proposed in this work can be extended to other tasks and agents proposed in both frameworks, respectively.

Autonomous driving task

This section describes the DRL task environment depicted in Fig. 1 in terms of the system of equations or dynamics that maps inputs (actions) to outputs (states), a definition of states or observation available to the agent, a definition of possible actions by the agent, and the reward that reflects the performance of the agents in terms of the action taken and the predefined goal of the environment. In the context of Autonomous Driving, the environment models the driving conditions (scenarios) and modifies those scenarios relative to the agent’s actions.

Fig. 1.

Fig. 1

Overview of the DRL task20.

Task environment

The task environment describes the dynamic interactions of all components in the underlying autonomous driving task in terms of the vehicles, the road network, and the control systems. Here, all the vehicles are modeled in terms of the Kinematic Bicycle model23, which can be expressed as in (1),

graphic file with name d33e412.gif 1

where Inline graphic and Inline graphic represent the coordinates of the vehicle’s central position in the Inline graphic- and Inline graphic-axes, respectively, Inline graphic denotes the vehicle’s speed, while Inline graphic corresponds to its angular velocity. The parameters Inline graphic and Inline graphic denote the acceleration and the slip angle at the center of gravity, respectively, and Inline graphic is the front wheel angle, which serves as the steering command. Accordingly, the dynamics of the vehicle are affected mainly by the road structure and the behavioral models of the vehicles. Here, the road layout consists of a multilane highway, where all vehicles, except the Ego vehicle, employ a classical yet realistic model that regulates their acceleration and steering (i.e, behavior). Specifically, the acceleration of the vehicles is governed by the Intelligent Driver Model (IDM)24, while lane-changing behavior is determined by the Minimizing Overall Braking Induced by Lane Change (MOBIL) model25. Further details on the system’s implementation can be found in20. Meanwhile, the Ego vehicle’s acceleration and steering are controlled based on the RL agent’s chosen actions.

State definition

In most DRL-based autonomous driving tasks, the agent observes the position and velocity of all vehicles within the Ego vehicle’s vicinity as well as those of the Ego vehicle2628. Therefore, the observation comprises a matrix of size Inline graphic where N is the number of nearby vehicles and F is the size of features. In this case, the features considered are the positions and velocities in the x- and y-axes, respectively.

Action definition

In a driving scenario, a vehicle has multiple possible actions. It can keep its current speed, speed up, slow down, or switch lanes by moving left or right while avoiding collisions. Here, although the agent is responsible for controlling the speed and steering of the Ego vehicle, the action space is discrete and bounded between 0 and 4 as expressed in (2) with each corresponding to a specific maneuver. To ensure stability, the DRL-based controller operates on top of a low-level vehicle kinematics controller. This means that an additional layer of speed and steering controllers is implemented, and the RL agent’s actions serve as reference points for these low-level controllers.

graphic file with name d33e510.gif 2

Reward definition

In this task, the agent’s goal is to drive the Ego vehicle to maintain a high average speed, avoid collisions, and stay in the rightmost lane whenever possible. Consequently, to achieve these goals, the reward is defined as follows: a penalty of Inline graphic is assigned in case of a collision, a reward of 0.1 is given for driving in the rightmost lane at each time step and a reward mapped linearly within the range [0, 0.4] based on the speed range of the Ego vehicle which is Inline graphic m/s is given to encourage driving at high speed. Notably, lane-changing actions do not receive any direct rewards in this setup. In other words, the total reward at each time step t is computed as:

graphic file with name d33e532.gif 3

where Inline graphic is either 0 or -1 (collision penalty), Inline graphic is either 0 or 0.1 (right-most lane reward), and Inline graphic falls within the range [0, 0.4] depending on the speed of the Ego vehicle.

To implement the environment based on the task description and interactions, we leveraged HighwayEnv20, a popular open-source 2D autonomous driving simulation environment. The modeling details, including vehicle speed and vehicle density, follow the same configurations established in prior studies29,30 that also used this simulation framework.

Multobjectivization

As expressed in Section 3.1.4, the reward function is an aggregated function that reflects multiple driving objectives. Although this is the standard formulation employed generally in DRL literature for this task, solving the DRL task with this reward formulation would only result in a single driving policy and does not provide any flexibility or trade-offs in cases of user preference or scenario-based driving goals. Consequently, we perform multiobjectivization by splitting up the reward function to realize three different driving objectives. Specifically, the resulting reward function is a 3-dimensional vector such that:

graphic file with name d33e567.gif 4

Specifically, both Inline graphic and Inline graphic related to high speed and driving on the rightmost lane objectives are to be maximized while Inline graphic related to collision is to be minimized. In other words, eqn. (4) can be rewritten from an optimization point-of-view which is generally posed as a minimization problem as:

graphic file with name d33e588.gif 5

where Inline graphic is a solution candidate and, in this context, is made up of the weights of the policy network. It is important to note that since collision does not conflict with driving at high speed or driving in the rightmost lane, the collision objective is going to be featured in all the trade-off policies generated by the EMO algorithm. However, both driving at high speed and driving on the rightmost lane are conflicting. From an application point of view, a user who is most concerned about speed can select policies that give more priority to speed without any consideration for driving on the rightmost lane, which is arguably the slowest and safest lane on a highway. By contrast, solutions or policies that prioritize driving on the right lane can be selected when safety is prioritized.

We emphasize that the reward decomposition adopted in this section follows a standard and widely used multi-objective formulation and does not introduce a new MOMDP definition. The novelty of this work does not lie in the multi-objectivization itself, but in treating deep DRL policy optimization as a black-box multi-objective optimization problem, which enables the application of population-based neuroevolutionary algorithms to directly search for diverse Pareto-optimal policies under this formulation.

Multi-objective optimization problem definition

Transforming a reinforcement learning (RL) task into an optimization problem suitable for evolutionary algorithms is non-trivial. Most DRL agents consist of multiple networks that jointly determine the agent’s actions. Consequently, the weights of these networks must be encoded for evolutionary optimization and decoded during the evaluation of objectives within the Markov Decision Process (MDP) cycle. In this section, we provide a detailed formulation of the multi-objective optimization problem, emphasizing solution representation, function evaluation, and the key distinctions between NeuroAction and traditional gradient-based DRL methods.

Solution representation

As previously stated in Section 3, the agent in this work is built upon the SB3 framework. In the SB3 paradigm, agents generally consist of two sub-networks: a feature extraction network and a policy network. The feature extraction network extracts state features or information from the agent’s observation of the task environment. This extracted information is then passed to the policy network, which is responsible for mapping these states to corresponding actions. In other words, both networks affect the agent’s actions and need to be evolved to optimize the reward. Although SB3 provides functions to access the parameters of these networks, they are structured as a set of dictionary elements, where each element corresponds to weight matrices stored within a network state dictionary. Consequently, they cannot be directly optimized using Multi-Objective Evolutionary Algorithms (MOEAs) in their current form or representation. To enable seamless integration with existing MOEAs and frameworks, the parameters of these learning networks must first be converted into a vectorized format. This transformation presents a challenge, as an ineffective vectorization process could result in the loss of essential problem characteristics, leading to random optimization outcomes. If the relative positions of parameters are not preserved across generations, the dedicated reproduction operators (mutation and crossover) in MOEAs would no longer function as intended. To address this issue, we introduce an encoding scheme in which each parameter of the network dictionary is vectorized. Simultaneously, the corresponding key, along with the start and end of the element within the vector, is stored, and this newly constructed vector serves as the decision variable for optimization. Consequently, the dimension of the resulting optimization problem is the total number of trainable parameters (weights) of the two sub-networks. The DRL agent and its architecture can be chosen by the user, thereby influencing the problem’s dimensionality and landscape. To maintain consistency, the Advantage Actor-Critic (A2C) algorithm was used for the experiment in this study. The architecture of the feature extraction network is set as proposed in SB3, and a DNN with 2 fully connected layers with 32 units each is used as the policy network.

Function evaluation

From an optimization point of view, the goal of our optimization process is to initialize a population of policy networks and evolve their weights to obtain a set of policies that optimize the underlying reward vector. To quantify the associated objectives, we define the fitness functions as the average rewards accumulated over episodes of the MODRL task. In DRL, an episode is a sequence of interactions between the agent and the environment, starting from an initial state and ending when a terminal state is reached. Essentially, it represents the total number of interactions the agent can engage in before reaching the terminal state31. To ensure stability and robustness of the objective evaluation, we compute each objective as the negative average cumulative reward over multiple episodes. Specifically, for the three driving objectives (speed, lane positioning, and collision avoidance), the fitness functions are defined as:

graphic file with name d33e618.gif 6

where x is a d-dimensional decision vector representing the weights of the policy networks, Inline graphic is the number of episodes used to compute the average, and Inline graphic, Inline graphic, and Inline graphic are the cumulative rewards returned by the DRL agent for episode j corresponding to speed, lane position, and collision avoidance, respectively. The negative sign converts the maximization of rewards into a minimization problem compatible with multi-objective evolutionary algorithms. The overall objective vector is then defined as Inline graphic, capturing the conflicting driving goals to be optimized simultaneously, and Inline graphic is set to 2 for the analysis conducted in this work to balance evaluation reliability and computational efficiency. In traditional DRL algorithms, the return is typically computed using a discount factor Inline graphic, which emphasizes immediate rewards over those occurring further in the future. While SB3 internally applies this discounting mechanism (with Inline graphic) during standard gradient-based training, we clarify that it does not influence the evolutionary optimization within our proposed framework. During MOEA-based policy optimization, the environment outputs undiscounted episodic cumulative rewards, and these values are used directly in the objective functions without applying any discounting.

In NeuroAction, the function evaluation begins by transforming (decoding) the candidate solutions produced by the MOEAs into the trainable parameters of the agent’s neural networks. This transformation uses the stored keys and indices from the initial encoding step to reconstruct a new network state dictionary parameterized by the solutions from the MOEAs. Once reconstructed, the state dictionary is loaded into the agent’s networks, which are then assessed across two episodes using a standard MODRL protocol. During these episodes, the agent accumulates rewards based on its actions, and the average cumulative reward across both runs is used as the objective vector for the given solution. This method mirrors classical deep neuroevolution, where full networks are evaluated on the entire training data, and the performance metric (typically measured by average accuracy) is taken as the fitness score. In this optimization scenario, the fitness guiding the MOEA is defined as the negative average reward from the two episodes. This negative sign is applied because traditional MOEA tools are built for minimization tasks, whereas DRL frameworks focus on reward maximization. Adopting this convention ensures both compatibility with existing MOEA tools and fair benchmarking against other MODRL methods.

This approach differs from traditional DRL, where network weights are updated incrementally using gradient-based backpropagation. In NeuroAction, the entire network is treated as a candidate solution, allowing direct population-based optimization of multiple objectives simultaneously. This offers several advantages: it naturally balances conflicting objectives, mitigates the risk of converging to local optima, and generates a diverse set of policies that can be selected based on user preferences or scenario requirements. The framework integrates seamlessly with SB3, as weights are extracted, optimized via evolutionary operations, and reloaded into the agent, preserving the network architecture and maintaining compatibility with existing DRL implementations. Overall, this population-based approach enables more robust exploration and improved Pareto-front approximation in the context of multi-objective reinforcement learning with respect to traditional preference-based MORL methods.

Experiments

This section details the experiments carried out in this study, which are based on the reformulated autonomous driving task and three selected baseline EMO algorithms. We first describe the experimental setup from an optimization perspective, including the problem formulation and the EMO methods utilized. Subsequently, we present and analyze the results obtained from these experiments.

Experimental settings

From an optimization perspective, the DRL driving task considered in this work is a MOP with 3 objectives (M) and 2053 decision variables or problem dimension (D). Consequently, our choice of EMO algorithms is first motivated by the number of objectives. In EAs, there are algorithms designed specifically for MOPs with 2 or 3 objectives32 and those designed for problems with 3 or more objectives33. Consequently, we select three baseline algorithms within those designed for 2 or 3 objectives. Secondly, EMO algorithms are generally classified as either Pareto-based, decomposition-based, or indicator-based algorithms based on their feature mechanisms18. Within each of these respective classes, NSGA-II34, MOEA/D35, and HyPE36 are the standard and most widely used algorithms. Hence, we employ NSGA-II, MOEA/D, and HyPE for the analysis conducted in this work. Furthermore, we included MOEA/DD and NSGA-III, which are variants of NSGA-II and MOEA/D designed for many objective problems. The hyperparameters for each of the algorithms are set as proposed in their original papers. Other problem-specific parameters are set such that the population size N is 50 and the maximum function evaluation is set as 1000.

The optimization results are presented in terms of the Hypervolume (HV) metric. The HV metric is widely used in EMO problems where the true Pareto front is unknown. It measures the volume covered by the Pareto-front obtained by an algorithm relative to a predefined reference point37. Accordingly, the higher the Hypervolume, the better the algorithm. The reference point is set as Inline graphic, and the hypervolume is calculated accordingly.

Results and discussions

The results of this work are presented in terms of median Hypervolume (HV) computed over 6 independent runs of each EMO algorithm and the dispersion across runs measured in terms of the interquartile range. While the median HV values reflect an algorithm’s performance in optimizing the reward or objective function, the dispersion values measure the reliability of the algorithms from a DRL perspective38. Additionally, statistical tests based on Wilcoxon’s rank sum test are conducted at a significant level of 0.05 to show the significance of the results. Accordingly, “Inline graphic” denotes the best algorithm, “Inline graphic” denotes that the algorithm is statistically equal to the best algorithm, and “−” denotes that the associated algorithm is statistically inferior to the best-performing algorithm.

From the results presented in Table 1, NSGA-II demonstrated the best performance in terms of the median HV values compared to the other algorithms. The statistical analysis based on the rank-sum test also shows that the performance difference of NSGA-II is significant relative to MOEA/D and MOEA/DD. In contrast, the difference is statistically insignificant compared with NSGA-III and HypE. In other words, NSGA-II is statistically superior to both MOEA/D and MOEA/DD, and statistically equal to NSGA-III and HypE.

Table 1.

Median Hypervolume and dispersion across runs measured over 6 independent runs of each algorithm.

Algorithm Medain HV Dispersion across runs
NSGAII 1.1201e+5 + 1.65E+04
NSGAIII 1.0253e+5 = 9.59E+03
MOEADD 7.6506e+4 - 1.53E+04
HypE 9.8542e+4 = 1.19E+04
MOEAD 8.8373e+4 - 1.69E+04

A class-wise comparison also shows that Pareto-based algorithms (NSGA-II and NSGA-III) outperform algorithms from other classes with the decomposition-based algorithms (MOEA/D and MOEA/DD) performing worst. It can be concluded that the performance edge of dominance or Pareto-based algorithms is due to their ability to optimize all the objectives concurrently without breaking them into subproblems. On the other hand, the worst performance of decomposition-based algorithms can be attributed to the fact that although multiple objectives are considered, both objectives one (related to speed) and two (related to the right lane) cannot be optimized without objective three (related to collision) being optimized. In general, the performance of these different algorithms shows that the choice of algorithm is very important and should be dependent on the nature or characteristics of the DRL task from an optimization point of view.

From the dispersion across runs (DARs) presented in Table 1 and the normalized DARs presented in Fig. 2, NSGA-III can be considered more reliable compared with all the other algorithms. Interestingly, although both MOEA/D and MOEA/DD performed poorly based on the median HV values, they are ranked ahead of NSGA-II and HypE in terms of DARs. However, since DARs or reliability are often used as a second criterion because a poor-performing algorithm might generally have a better DARs value, it is better to employ the reliability metric among top-performing algorithms. Consequently, from a reliability point-of-view, NSGA-III can be chosen ahead of NSGA-II.

Fig. 2.

Fig. 2

Plots of normalized dispersion across runs.

Pareto front-based analysis

To gain further insights into the performance of each of the algorithms, we present the Pareto front (Pareto optimal policies) obtained by each algorithm. The analysis of the obtained Pareto Fronts (PFs) in Fig. 3 reveals a critical relationship between an algorithm’s selection pressure and its final HV score, particularly concerning the achievement of extreme-end solutions. While the decomposition-based algorithms, MOEA/D, MOEA/DD, and the indicator-based HypE, demonstrate relatively superior diversity and uniformity across the objective space, their reported poor performance is directly attributed to their inability to secure the optimal extreme solution corresponding to objective Inline graphic (speed). Specifically, the Inline graphic objective for MOEA/D ranges between 10 and 30, MOEA/DD ranges between 220 and 40, and Hype ranges between 10 and 50. Consequently, their resulting trade-off policies in Inline graphic and Inline graphic are ultimately dominated by the Inline graphic extreme solution found by NSGA-II. Since the Hypervolume metric is highly sensitive to non-dominated points closest to the axes (the extreme trade-offs), this small but critical convergence deficit at the Inline graphic boundary significantly penalizes the overall HV score of these algorithms. Conversely, the performance edge of NSGA-II, resulting in the best Median HV, stems from its ability to generate a solution set that, though often visually sparse or less uniform than its decomposition counterparts, contains the single most dominant solution at the minimum extreme of Inline graphic. This superior extreme-point convergence captures a disproportionately large area of the objective space near the ideal point, which maximizes the HV contribution and overrides any perceived deficiency in solution density. NSGA-III behaves similarly to NSGA-II, thereby maintaining a competitive HV ranking.

Fig. 3.

Fig. 3

Pareto-optimal policies (non-dominated solutions) obtained by NeuroAction under five multi-objective evolutionary algorithms (NSGA-II, NSGA-III, MOEA/D, MOEA/DD, and HypE). The axes correspond to the objectives: Inline graphic – speed, Inline graphic – preference for driving on the rightmost lane, and Inline graphic – collision avoidance.

Using NSGA-II as a case study, we analyze the driving behavior of selected solutions on its Pareto Front. Specifically, we present the behaviors of two policies on the Pareto front of NSGA-II, depicted in red in Fig. 3. The first policy, shown in Fig. 4, corresponds to a reward vector of Inline graphic. Based on this reward vector, it is expected that the agent prioritizes maintaining the right lane over achieving high speed. Indeed, this is consistent with the observed behavior: at the beginning of the driving episode, the agent immediately moved to the right-most lane and greedily maintained it throughout the episode. Ultimately, the agent collided with another vehicle ahead in the same lane. This behavior illustrates a conservative driving strategy that favors positional safety at the cost of speed. The second policy, with an objective vector of Inline graphic as shown in Fig. 5, demonstrates a different strategy. Initially, the agent also moved to the right-most lane, but it did so after observing that other vehicles were switching lanes, leaving the right-most lane free for high-speed driving. As a result, the agent was able to drive the Ego vehicle at a relatively high speed while avoiding collisions, thereby attaining a high reward value on objective Inline graphic. This behavior exemplifies a more aggressive strategy that balances speed and safety, exploiting the dynamics of other agents in the environment.

Fig. 4.

Fig. 4

Performance of a Pareto optimal policy depicting 8 consecutive frames from start to collision. The policy obtained a reward vector of [7.60, 487, 0] as depicted in red in Fig. 3.

Fig. 5.

Fig. 5

Performance of a Pareto optimal policy depicting 9 frames (the first 6 consecutive frames followed by frames 50, 100, and 150, respectively). The policy obtained a reward vector of [52.80, 463.50, 0] as depicted in red in Fig. 3.

In general, it should be noted that lane changes are not explicitly rewarded in this framework. Consequently, agents tend to minimize lane changes, since such maneuvers do not directly improve multiple driving objectives. As a result, some agents avoid collisions by simply reducing their speed, though this strategy can still lead to accidents if the vehicle ahead moves unusually slowly, which is the case as observed in Fig. 4. These two examples highlight the behavioral diversity of solutions along the Pareto Front. Each solution represents a distinct trade-off among the objectives: some policies prioritize positional safety, others speed, and yet others balance both. This diversity suggests that a preference-based selection strategy can be applied to choose policies according to specific driving goals. Ultimately, the Pareto Front provides a spectrum of policies, each with unique behavioral characteristics, enabling flexible, goal-directed decision-making in multi-objective driving tasks.

Comparison of NSGA-II and gradient-based multi-objective reinforcement learning

To provide a principled comparison with gradient-based deep reinforcement learning in the multi-objective setting, we evaluate our proposed neuroevolutionary framework against Multi-Objective Reinforcement Learning using Policy Gradients (MORL-PG). MORL-PG represents the closest gradient-based analogue to our approach, as it explicitly addresses vector-valued rewards through preference-based scalarization and approximates the Pareto front by training multiple policies under different weight vectors. Following established formulations in multi-objective reinforcement learning, we adopt a preference-based policy-gradient baseline that learns multiple policies under distinct scalarization weights to recover diverse trade-off solutions39,40. In our implementation, MORL-PG employs a policy network consisting of a deep neural network with two fully connected hidden layers of 32 units each and ReLU activations, followed by a discrete action output layer. Training is performed using a REINFORCE-style policy gradient update with linear reward scalarization, matching the environment, episode horizon, and evaluation protocol used in our evolutionary experiments.

The quantitative comparison is summarized in Table 2, where performance is assessed using the hypervolume (HV) indicator computed over the learned Pareto sets. NSGA-II achieves a median HV of Inline graphic with a DARs of Inline graphic, corresponding to a coefficient of variation (CV) of approximately 0.15, which indicates relatively stable convergence behavior with respect to its performance scale. In contrast, MORL-PG attains a substantially lower Median HV of Inline graphic, with a DARs of Inline graphic, yielding a higher CV of approximately 0.25. The individual MORL-PG HV values exhibit noticeable variability, ranging from approximately Inline graphic to Inline graphic, reflecting sensitivity to preference selection and stochastic gradient updates. This higher relative dispersion highlights the difficulty of maintaining stable and consistent Pareto coverage when using preference-based scalarization and gradient-driven optimization in high-variance environments.

Table 2.

Comparison between NSGA-II and MORL-PG in terms of hypervolume (HV).Results are reported as median and dispersion across runs, over 6 independent runs.

Algorithm Median HV DARs
NSGA-II Inline graphic Inline graphic
MORL-PG Inline graphic Inline graphic

These results indicate that while MORL-PG provides a meaningful and well-established gradient-based baseline for multi-objective reinforcement learning, it is substantially outperformed by the proposed neuroevolutionary approach in terms of both Pareto-front quality and robustness across runs. The observed performance gap can be attributed to fundamental differences between gradient-based preference learning and population-based neuroevolutionary search. In particular, MORL-PG relies on preference-based scalarization and stochastic policy-gradient updates, which can lead to sensitivity to weight selection and higher variability across runs, whereas the evolutionary framework directly maintains and evolves a diverse population of solutions, enabling more consistent exploration and coverage of the Pareto front under the same evaluation protocol. From a conceptual algorithmic perspective, preference-based MORL methods such as MORL-PG approximate the Pareto front by training multiple independent policies under different scalarization weights, each involving its own gradient-based optimization process. Due to this structure, MORL-PG may incur a higher training burden compared to population-based neuroevolutionary methods, which evolve a single population and explore multiple trade-offs jointly within one unified optimization process. We emphasize that this observation is intuitive based on the underlying optimization structure, rather than a quantitative claim about computational cost, and that the primary focus of this work is on comparative solution quality and Pareto-front approximation behavior in the multi-objective setting rather than empirical runtime efficiency.

In practical deployment, the learned Pareto front provides actionable policy choices that extend beyond purely algorithmic comparison. In particular, a knee-point solution can be selected to achieve a balanced trade-off between driving speed and lane preference while maintaining collision avoidance, making it suitable for general-purpose driving. Beyond knee-point selection, policies can be adaptively chosen based on driving context or user preference. For example, in leisure-oriented scenarios such as sightseeing, policies that prioritize right-lane driving may be preferred, whereas time-critical situations (e.g., catching a train) may favor speed-dominant policies. Importantly, all such task- or preference-driven policy selections can be made directly from the learned Pareto set without retraining, highlighting a key practical advantage of population-based Pareto front approximation for multi-objective autonomous driving.

Limitations of the current study

While the present study focuses on the well-established Highway-env benchmark widely used in the autonomous driving and deep RL literature, we acknowledge that evaluation in higher-fidelity simulators would strengthen practical relevance. As future work, we plan to integrate NeuroAction with realistic driving platforms such as CARLA and SUMO, enabling assessment under complex sensor models, continuous-control dynamics, and more diverse traffic conditions. Additionally, in the current problem formulation, collision avoidance is treated as one of the objectives in the multi-objective optimization. In real-world autonomous driving scenarios, it is often more appropriate to model collision avoidance as a hard constraint rather than an objective, ensuring that all vehicles in the environment satisfy basic safety requirements. Reformulating collisions as a constraint would allow the framework to explicitly enforce safety while focusing the objectives on other performance metrics, such as efficiency or comfort. We plan to explore this constraint-based formulation in future work to improve the practical applicability and safety guarantees of NeuroAction.

Conclusion and future works

This work presents a Neuroevolutionary approach to reinforcement learning-based autonomous vehicles capable of optimizing multiple driving goals simultaneously. In the approach, policy search or optimization in DRL is transformed into a multi-objective black-box optimization problem. Consequently, multi-objective evolutionary algorithms can be used to optimize the parameters of the black-box (Policy network) to realize different policies or agents that provide a trade-off between the different driving objectives. Such trade-off policies are important because they can be leveraged to facilitate user-based or scenario-based autonomous driving. Results based on five different multi-objective algorithms demonstrate the suitability of the approach and also show that performance is based on the selected algorithm.

In the future, designing dedicated multi-objective large-scale expensive optimization algorithms could be of interest, as the resulting optimization problems are computationally challenging. Additionally, incorporating other meaningful objectives not currently captured in the DRL framework, such as those related to lane changes, could enhance the diversity and practical relevance of the resulting policies. Finally, exploring hybrid approaches that combine evolutionary search with gradient-based refinement may further improve the quality and efficiency of MO-DRL policy generation.

Acknowledgements

This research was supported by the Core Research Institute Basic Science Research Program through the National Research Foundation of Korea(NRF), funded by the Ministry of Education, Korea (RS-2021-NR060127).

Author contributions

Esther Aboyeji: Conceptualization, Methodology, Software, Writing - Original draft preparation, Writing - Review & Editing. Oladayo S. Ajani: Conceptualization, Methodology, Software, Writing - Original draft preparation, Writing - Review & Editing. Ivan Fenyom: Software, Writing - Review & Editing. Rammohan Mallipeddi: Supervision, Resources, Validation, Writing - Review & Editing.

Funding

This work was supported by the Core Research Institute Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Korea, under grant number RS-2021-NR060127.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Oladayo S. Ajani, Email: oladayosolomon@gmail.com

Rammohan Mallipeddi, Email: mallipeddi.ram@gmail.com.

References

  • 1.Ravi Kiran, B. et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst.23(6), 4909–4926 (2022). [Google Scholar]
  • 2.Jingda, W. et al. Recent advances in reinforcement learning-based autonomous driving behavior planning: A survey. Transp. Res. C Emerg. Technol.164, 104654 (2024). [Google Scholar]
  • 3.Lin-Chi, W., Zhang, Z., Haesaert, S., Ma, Z. & Sun, Z. Risk-aware reward shaping of reinforcement learning agents for autonomous driving. In IECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society 1–6 (2023).
  • 4.Ahmed, A., Jonas, M. & Johann Marius, Z. A review of reward functions for reinforcement learning in the context of autonomous driving. 2024 IEEE Intelligent Vehicles Symposium (IV) 156–163, (2024).
  • 5.George, P., Dawn Xiaodong, S., & Jaime G. C. The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions. arXiv: Learning (2017).
  • 6.Tim, S., Jonathan, H., Xi, C., & Ilya, S. Evolution strategies as a scalable alternative to reinforcement learning. (2017).
  • 7.Majid, A. Y., Saaybi, S., Vincent Francois-Lavet, R., Prasad, V. & Verhoeven, C. Deep reinforcement learning versus evolution strategies: A comparative survey. IEEE Trans. Neural Netw. Learn. Syst.35(9), 11939–11957 (2024). [DOI] [PubMed] [Google Scholar]
  • 8.Felipe Petroski, S., Vashisht, M., Edoardo, C., Joel, L., Kenneth O. S., & Jeff, C. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567 (2017).
  • 9.Ajani, O. S., Kumar, A. & Mallipeddi, R. Covariance matrix adaptation evolution strategy based on correlated evolution paths with application to reinforcement learning. Expert Syst. Appl. 123289 (2024).
  • 10.Ajani, O. S. & Mallipeddi, R. Adaptive evolution strategy with ensemble of mutations for reinforcement learning. Knowl.-Based Syst.245, 108624 (2022). [Google Scholar]
  • 11.Bai, H., Cheng, R. & Jin, Y. Evolutionary reinforcement learning: A survey. Intell. Computing2, 0025 (2023). [Google Scholar]
  • 12.Abdel-Aty, M. & Ding, S. A matched case-control analysis of autonomous vs human-driven vehicle accidents. Nat. Commun.15(1), 1–12 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Changjian, L., & Krzysztof, C. Urban driving with multi-objective deep reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19 359–367 (International Foundation for Autonomous Agents and Multiagent Systems, 2019).
  • 14.Kalyanmoy, D. Multi-Objective Optimization Using Evolutionary Algorithms (John Wiley & Sons Inc, 2001). [Google Scholar]
  • 15.Wang, T., Peng, X., Wang, T., Liu, T. & Demin, X. Automated design of action advising trigger conditions for multiagent reinforcement learning: A genetic programming-based approach. Swarm Evol. Comput.85, 101475 (2024). [Google Scholar]
  • 16.Ma, X. et al. Multiobjectivization of single-objective optimization in evolutionary computation: A survey. IEEE Trans. Cybern.53(6), 3702–3715 (2023). [DOI] [PubMed] [Google Scholar]
  • 17.Conor F. H., Roxana, R., Eugenio, B., Johan, K., Matthew, M., Mathieu, R., Timothy, V., Luisa M. Z., Richard, D., & Fredrik H., et al. A practical guide to multi-objective reinforcement learning and planning. arXiv:2103.09568, (2021).
  • 18.Zhou, A. et al. Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm Evol. Comput.1(1), 32–49 (2011). [Google Scholar]
  • 19.Hui, B., Ran, C., & Yaochu, J. Evolutionary reinforcement learning: A survey. https://arXiv.org/abs/2303.04150. (2023).
  • 20.Edouard, L. An environment for autonomous driving decision-making. https://github.com/eleurent/highway-env (2018).
  • 21.Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. Openai gym, (2016).
  • 22.Raffin, A. et al. Stable-baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res.22(268), 1–8 (2021). [Google Scholar]
  • 23.Philip, P., Florent, A., Brigitte d’Andréa, N., Arnaud de La F. The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles? In 2017 IEEE Intelligent Vehicles Symposium (IV) 812–818, (2017).
  • 24.Treiber, Hennecke, & Helbing. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdisc. Top.62 (2 Pt A), 1805–24. (2000). [DOI] [PubMed]
  • 25.Kesting, A., Treiber, M. & Helbing, D. General lane-changing model mobil for car-following models. Transp. Res. Rec.1999(1), 86–94 (2007). [Google Scholar]
  • 26.Jurecki, R. S. An analysis of collision avoidance manoeuvres in emergency traffic situations. Arch. Auton. Eng.-Archiwum Motoryzacji72, 73–93 (2016). [Google Scholar]
  • 27.Hwasoo, Yeo., Kitae, Jang., & Alexander, Skabardonis. Impact of traffic states on freeway collision frequency. In Transportation Research Board (TRB) 89th Annual Meeting, (2010).
  • 28.Li, Z., Wang, W., Chen, R., Liu, P. & Cheng-Xian, X. Evaluation of the impacts of speed variation on freeway traffic collisions in various traffic states. Traffic Inj. Prev.14, 861–866 (2013). [DOI] [PubMed] [Google Scholar]
  • 29.Xiaochang, C., Jieqiang, W., Xiaoqiang, R., Karl H. J., & Xiaofan W. Automatic overtaking on two-way roads with vehicle interactions based on proximal policy optimization. In 2021 IEEE Intelligent Vehicles Symposium (IV) 1057–1064, (2021).
  • 30.Edouard, L., Denis, E., Tarek, R., & Wilfrid, P. Interval prediction for continuous-time systems with parametric uncertainties. pages 7049–7054, (2019).
  • 31.Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95 1038–1044, (MIT Press, 1995).
  • 32.Sharma, S. & Kumar, V. A comprehensive review on multi-objective optimization techniques: Past, present and future. Arch. Computat. Methods Eng.29, 5605–5633 (2022). [Google Scholar]
  • 33.Bingdong, L., Jinlong, L., Ke, T., & Xin, Y. Many-objective evolutionary algorithms: A survey. ACM Comput. Surv.48 (1), (2015).
  • 34.Deb, K., Pratap, A., Agarwal, S. & Meyarivan, T. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans. Evol. Comput.6(2), 182–197 (2002). [Google Scholar]
  • 35.Zhang, Q. & Li, H. Moea/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput.11(6), 712–731 (2007). [Google Scholar]
  • 36.Bader, J. & Zitzler, E. Hype: An algorithm for fast hypervolume-based many-objective optimization. Evol. Comput.19(1), 45–76 (2011). [DOI] [PubMed] [Google Scholar]
  • 37.Guerreiro, A. P., Fonseca, C. M. & Paquete, L. The hypervolume indicator: Computational problems and algorithms. ACM Comput. Surv.54 (6), (2021).
  • 38.Stephanie C. Y. Chan., Sam, Fishman., John F. Canny., Anoop Korattikara, Balan., & Sergio, Guadarrama. Measuring the reliability of reinforcement learning algorithms. https://arXiv.org/abs/1912.05663. (2019).
  • 39.Axel, Abels., Diederik, Roijers., Tom, Lenaerts., Ann, Nowé., & Denis, Steckelmacher. Dynamic weights in multi-objective deep reinforcement learning. In International Conference on Machine Learning. 11–20. (PMLR, 2019).
  • 40.Diederik, M. R., Peter, V., Shimon, W. & Richard, D. A survey of multi-objective sequential decision-making. J. Artif. Intell. Res.48, 67–113 (2013). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES