Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 21;16:3041. doi: 10.1038/s41598-025-33218-w

A proposed reinforcement learning approach via discrete control reformulation and multi-step double DQN for adaptive cruise control in electric vehicles

Assem Meghawer 1,2, Yasser El-Shaer 1, Mohamed I Abu El-Sebah 3, Mohamed Fawzy El-Khatib 4,
PMCID: PMC12828003  PMID: 41565773

Abstract

Adaptive cruise control (ACC) plays a critical role in enhancing road safety and energy efficiency in electric vehicles (EVs). Traditional ACC approaches often face challenges in adapting to complex, dynamic driving environments. AI-driven reinforcement learning (RL) has emerged as a promising solution; however, its real-world adoption faces key challenges, including training stability, convergence speed, and robustness in diverse scenarios. This work reformulates the ACC control structure using a simplified action abstraction that unifies throttle and brake into a single scalar variable within a discrete action space. This design enables smooth, human-like driving behavior while allowing the use of simpler and more stable Deep Q-Network (DQN) variants. To address this, we integrate multi-step returns with Double DQN architecture (Double-MS DQN) to accelerate convergence and enhance policy stability. A stochastic scenario generator is also implemented to expose the agent to varied and unpredictable lead-vehicle behaviors during training and evaluation. The results conducted in the CARLA simulator show that the proposed approach achieves significantly faster convergence (up to 73% reduction in training episodes) and reduces headway errors by over 40% compared to standard DQN, Dueling DQN, and Double DQN. The proposed Double-MS DQN demonstrates that adapting the RL control formulation enables high-performance learning with lightweight, scalable algorithms delivering safer and smoother control in practice.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-33218-w.

Keywords: Adaptive cruise control, Electric vehicles, Reinforcement learning, Deep Q-Network, Double deep Q-Network, Multi-Scenario testing

Subject terms: Engineering, Mathematics and computing

Introduction

ACC is a crucial driver-assistance feature that automatically adjusts a vehicle’s speed to maintain a safe distance from preceding vehicles, thereby enhancing driving comfort and reducing driver workload1,2. With the increasing adoption of EVs, ACC plays a critical role not only in safety but also in promoting energy-efficient driving behaviors3,4. EVs, emphasizing regenerative braking and smooth longitudinal control, necessitate sophisticated ACC designs that integrate closely with their unique dynamic characteristics4,5.

Traditional ACC methods, primarily proportional-integral-derivative (PID) controllers and model predictive control (MPC), have been extensively studied68. PID controllers offer simplicity and robustness but often lack adaptability to nonlinear or rapidly changing environments8,9. MPC, although capable of handling multi-variable constraints, demands precise modeling and considerable computational resources, restricting its real-time applicability in dynamic urban traffic conditions6,9. Both PID and MPC typically rely on fixed-rule optimizations, limiting their adaptability to uncertainties and stochastic real-world driving behaviors10,11. Thus, the exploration of adaptive solutions, such as RL, that autonomously learn optimal behaviors has become increasingly essential.

Related work and motivation

AI-driven RL has emerged as a powerful control methodology due to its ability to learn optimal policies directly from environmental interactions without explicit modeling1214. The introduction of Deep Q-Networks (DQN) by Mnih et al.14 demonstrated human-level performance on complex tasks, making RL a promising candidate for ACC systems. Subsequent developments, including Double Q-Learning15 to address Q-value overestimations and dueling network architectures16 to separate state-value and advantage functions, significantly improved RL efficiency and stability. Studies adapting these innovations to vehicular control have demonstrated improved safety margins and smoother controls compared to traditional methods3,17. Despite these advancements, existing RL-based ACC studies frequently focus on limited scenarios, evaluating agents predominantly in static or simplified conditions3,10,11. Real-world driving requires generalized capabilities across diverse traffic scenarios and emergency conditions, highlighting the necessity of comprehensive multi-scenario evaluations for robust ACC implementations.

Recent research efforts have introduced advanced methodologies addressing specific deployment-phase aspects of ACC. Notably, the safety-first reinforcement learning adaptive cruise control (SFRL-ACC) method explicitly incorporates safety constraints within a constrained Markov decision process (CMDP) framework to ensure safe driving policies under strict safety requirements18. Similarly, the large language model guided deep reinforcement learning (LGDRL) approach employs large language models to guide and enhance DRL training efficiency, demonstrating significant improvements in task success rates and policy generalization in autonomous driving scenarios19. The vision-language model reinforcement learning (VLM-RL) method leverages semantic reward signals derived from vision-language models, achieving marked reductions in collision rates and substantial improvements in route completion metrics20. However, these state-of-the-art methodologies predominantly focus on optimizing final deployment-phase safety and performance, with limited attention explicitly directed towards enhancing the RL training process itself, particularly convergence speed, training stability, and robustness.

The efficiency and stability of the RL training phase are crucial yet often underexplored aspects in developing practical ACC solutions. Faster convergence and stable training reduce computational demands and training time, significantly enhancing scalability and deployment feasibility15,21. Furthermore, robust training methods that produce consistent and reliable policies directly influence real-world applicability, safety, and passenger comfort3,22. Despite their critical importance, studies explicitly targeting RL training efficiency and stability within ACC contexts remain sparse. To bridge this crucial gap, explicit enhancements focusing on the training phase—such as Multi-Step Return techniques integrated with Double DQN are necessary.

While the foundational methods (DQN, Double DQN, Dueling DQN) provide a strong baseline, other research avenues have explored continuous-action algorithms. Methods such as Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3) have been applied to complex vehicular control and real-time energy management tasks, offering finer granularity and handling hybrid discrete-continuous action spaces23,24. However, these continuous methods can introduce higher sample complexity and tuning sensitivity, motivating our focus on optimizing the efficiency of simpler, more stable DQN-based approaches. Furthermore, the scope of this study is longitudinal control; emerging research is tackling broader challenges, including multi-agent (MARL) coordination in connected environments25, as well as frameworks for safe and energy-efficient control under strict physical constraints26.

Contributions and paper organization

To address these critical gaps, this paper proposes a reinforcement learning enhancement for ACC: the Multi-Step Return Enhanced Double DQN (Double-MS). By incorporating multi-step return techniques into the Double DQN framework, the proposed agent accelerates learning, improves reward estimation, and promotes smoother policy convergence. This enhancement is systematically evaluated through rigorous multi-scenario testing in the CARLA simulator27, encompassing diverse driving behaviors and randomized lead vehicle profiles to test generalization and robustness. The CARLA simulator is widely recognized for its high-fidelity urban environments, realistic vehicle dynamics, and flexibility in scenario customization, making it a preferred platform for autonomous driving research. CARLA’s synchronous operation mode provides precise control over simulation steps, essential for reproducible RL training and testing27. Its capabilities enable rigorous multi-scenario evaluations necessary for assessing RL-based ACC methods under diverse, dynamic conditions. Consequently, CARLA was selected as the simulation platform to ensure high experimental control, reproducibility, and realism for evaluating the proposed Multi-Step Return Enhanced Double DQN approach.

The main contributions of this study are summarized as follows:

  • Integration and rigorous evaluation of a Multi-Step Return Enhanced Double DQN (Double-MS) agent. We demonstrate how this integration significantly accelerates training for the ACC problem, addressing a key gap in practical RL deployment.

  • A simplified discrete action control scheme that unifies throttle and brake, which we show is sufficient for high-performance control and enables the use of more stable, value-based DQN methods.

  • Formulation of a unified control signal representing both throttle and brake, reducing action complexity, avoiding conflicting actuation, and aligning with human driving behavior.

  • Implementation of a stochastic scenario generator to simulate diverse lead-vehicle behaviors during training and evaluation, supporting robust generalization and realistic safety validation.

  • Quantitative evaluation demonstrates that the proposed agent outperforms standard DQN, Double DQN, and Dueling DQN in training speed, tracking accuracy, control smoothness, and safety, with all comparisons supported by structured logging and clear convergence tracking.

The rest of this paper is organized as follows: Sect. 2 presents the methodology, including vehicle dynamics modeling, reinforcement learning formulation, and algorithmic design. Section 3 details the implementation and experimental setup. Section 4 reports and analyzes the experimental results, while Sect. 5 provides a broader discussion. Finally, Sect. 6 concludes the paper and outlines future research directions.

Methodology

The core objective of this research is to develop and evaluate an adaptive and intelligent longitudinal control approach for EVs using advanced reinforcement learning techniques. Our methodology strategically integrates rigorous physical vehicle dynamics modeling with a sophisticated learning-based decision-making framework. Specifically, we employ Deep Q-Learning algorithms to formulate and solve the ACC problem, effectively addressing real-world challenges such as vehicle-following safety, control smoothness, and robustness against dynamic driving scenarios. The methodology begins by clearly defining the fundamental vehicle dynamics necessary for accurately simulating longitudinal vehicle behavior. Subsequently, we formulate the control task explicitly as a Markov Decision Process (MDP), carefully selecting state representations, action abstractions, and reward functions to ensure stable and efficient learning. Finally, we introduce our primary algorithmic innovation, the integration of Multi-Step Returns into a Double Deep Q-Network (Double-MS DQN) architecture which significantly enhances learning stability, convergence speed, and overall policy quality.

Vehicle dynamics modeling

Accurately modeling vehicle dynamics is crucial for effective ACC implementations, especially in simulation-based evaluations. This research uses a simplified yet robust longitudinal vehicle dynamics model that sufficiently captures the necessary characteristics for evaluating ACC systems within realistic simulation conditions. The model employs fundamental kinematic equations to simulate the longitudinal motion of both ego and lead vehicles. The position and velocity of each vehicle at the next time step Inline graphic are calculated based on their current state and applied control inputs (throttle and brake). This modeling approach is widely adopted in ACC and longitudinal dynamics studies28. Specifically, the equations governing vehicle position Inline graphic and velocity Inline graphic updates are expressed as follows:

graphic file with name d33e407.gif 1
graphic file with name d33e411.gif 2

where Inline graphic is the vehicle’s longitudinal acceleration at time Inline graphic, and Inline graphic is the fixed simulation time step set explicitly to 0.01s in this research, aligning with the CARLA simulator settings used in our experiments. To ensure realistic and safe following behavior, especially considering Electric Vehicle (EV) dynamics, we estimate the required safe distance using a widely adopted model based on time headway principles:

graphic file with name d33e429.gif 3

In this formulation Inline graphic​ is the time headway (typically 2 s), representing the minimum reaction time buffer between vehicles. Inline graphicis a fixed safety margin (in meters) added to account for braking uncertainty, actuation delay, or sensor noise. Inline graphicis a smoothed ego velocity used instead of raw speed to suppress high-frequency jitter and avoid unstable safe-distance estimation. This model is widely used in both industrial and research-grade ACC systems6,25 due to its simplicity, interpretability, and adaptability across different driving conditions. It scales linearly with vehicle speed, ensuring that the safe distance increases appropriately at higher velocities, a crucial property for preventing rear-end collisions in highway and urban driving alike. By combining time-based spacing with a fixed spatial buffer, this approach balances responsiveness and robustness, making it well-suited for simulation-based training and deployment in real-world electric vehicle platforms.

To ensure that the safe distance estimation remains stable and resistant to noise-induced fluctuations, we avoid relying on the raw ego vehicle speed and instead propose the use of a smoothed velocity signal. This smoothed value, denoted as Inline graphic, is computed using an exponential moving average (EMA) that recursively blends past and present measurements:

graphic file with name d33e460.gif 4

where Inline graphic is the smoothing factor. While the EMA formulation is well-established in vehicle control and signal processing28, its integration into the safe distance estimation in this work serves as a practical enhancement that significantly improves stability under fluctuating speed conditions.

This design choice addresses a key limitation of raw speed signals: while they respond rapidly to control changes, they are also prone to transient oscillations and momentary spikes particularly during frequent throttle or brake adjustments. If used directly in safety-critical calculations such as headway estimation, these fluctuations may trigger unnecessary or unstable control actions. By smoothing the velocity signal, we provide a more stable and reliable basis for computing the target following distance, reducing overreaction and promoting smoother control behavior.

Moreover, since the reward function is directly shaped by the agent’s ability to maintain the safe distance, stabilizing the reference signal improves reward consistency and accelerates policy learning. Overall, this simple yet effective enhancement contributes to both training convergence and the physical plausibility of the resulting driving behavior.

Reinforcement learning formulation

To effectively address the dynamic and uncertain nature of adaptive cruise control scenarios, this research formulates the longitudinal vehicle control task explicitly as a Markov Decision Process (MDP). Reinforcement learning (RL) is particularly suitable for adaptive cruise control tasks, as it naturally supports sequential decision-making under uncertainty, delayed feedback, and dynamic environmental conditions typical in driving applications14,29. Formally, the MDP is defined as a tupleInline graphic, where Inline graphic is the set of states, Inline graphic is the set of available actions, Inline graphic denotes the transition dynamics, Inline graphic is the reward function, and Inline graphic is the discount factor that prioritizes immediate versus future rewards29. The primary objective in this formulation is to learn an optimal policy Inline graphic that maximizes the expected cumulative discounted reward over an infinite horizon:

graphic file with name d33e522.gif 5

where Inline graphic represents the immediate reward at time step Inline graphic. Acceleration and braking choices impact not only immediate comfort and safety, but also long-term stability and efficiency — making RL an ideal fit for optimizing such behavior. To support effective policy learning, it is critical that the agent’s observations capture all the relevant variables that define the driving context and influence future outcomes. Accordingly, we encode the agent’s perception of its environment as a continuous seven-dimensional state vector, carefully constructed to reflect the key dynamics of safe following and smooth control. At each time step Inline graphic, the observed state is defined as:

graphic file with name d33e540.gif 6

where Inline graphic and Inline graphic are ego speed and acceleration, Inline graphicis the relative distance to the lead vehicle, Inline graphicandInline graphic represent the relative speed and acceleration between the two vehicles essential for anticipating and reacting to lead vehicle behavior. Inline graphic​ introduced earlier, provides a continuously updated estimate of the minimum safe distance, incorporating both speed and temporal headway. Finally,Inline graphic denotes the previously applied control value, included to help the agent reason about temporal consistency and penalize abrupt control changes. This compact yet expressive state representation allows the agent to infer both its absolute motion and the dynamic interaction context, enabling it to learn context-aware control strategies that prioritize safety, comfort, and responsiveness. Figure 1 illustrates the overall interaction between the reinforcement learning agent and the simulated CARLA environment. It shows how the agent receives seven-dimensional state observations from the environment, processes them through a neural network, and selects one of three discrete actions. The figure also visualizes how the agent processes the observed state vector, chooses an action, and generates vehicle control commands, forming a closed control and feedback loop.

Fig. 1.

Fig. 1

Overview of the reinforcement learning framework.

This diagram highlights the end-to-end loop that governs learning: from input features to action selection, through environment feedback and reward collection. It also emphasizes the unified control mapping used to simplify the throttle-brake decision process and stabilize the learning dynamics.

A key innovation in our methodology lies in the design of a hybrid control abstraction that balances simplicity in learning with the realism required for smooth vehicle operation. Rather than directly learning continuous throttle and brake signals — which would necessitate more complex actor-critic architectures and increase model instability we formulate control as a single variable Inline graphic and define the agent’s action space over three discrete choices: increase, decrease, or maintain the control value. Specifically, the agent selects from: Action 0: increase control (Inline graphic), Action 1: decrease control (Inline graphic), and Action 2: maintain control (Inline graphic). The resulting control value is interpreted based on its sign: when Inline graphic, it is applied as a throttle command; when Inline graphic, it maps to braking with magnitude Inline graphic; and when Inline graphic, the vehicle coasts. This design naturally mirrors how human drivers operate a vehicle — people rarely apply throttle and brake simultaneously, and their inputs tend to be gradual and sequential, not abrupt. A driver increasing speed will apply throttle in small steps, and slowing down involves a gradual release before braking, passing through intermediate values. By embedding this progression into the action space, our design ensures that the learned control behavior is not only technically stable but also behaviorally plausible.

Moreover, unifying throttle and brake into a single latent control dimension significantly reduces the complexity of the action representation, avoiding the need to learn coordination between two separate outputs. This simplification enables the effective use of discrete-action deep reinforcement learning algorithms such as DQN and its variants, which are computationally efficient, easier to train, and more robust compared to their continuous-action counterparts. Despite this simplicity, the controller retains the ability to produce fine-grained, adaptive behavior while inherently minimizing control noise and oscillations ultimately improving both training performance and real-world applicability.

To guide the agent’s behavior toward safe, efficient, and comfortable driving, we design a multi-objective reward function that balances several key performance indicators: progress, safety, control smoothness, and collision avoidance. At each timestep Inline graphic, the agent receives a scalar reward Inline graphic​ computed as:

graphic file with name d33e637.gif 7

where: Inline graphicreward for making progress, Inline graphic penalty for distance error, Inline graphic penalty for sudden control changes, Inline graphic large penalty for collision-like proximity, Inline graphic​ is an indicator function active when Inline graphicunder certain threshold. This reward structure reflects a carefully tuned compromise: it motivates the agent to move forward efficiently when appropriate yet imposes significant penalties for unsafe proximity or erratic behavior. Importantly, the control variation penalty Inline graphicplays a dual role it not only encourages smooth driving but also reduces unnecessary actuation, which in electric vehicles translates to lower energy consumption and reduced battery strain. By integrating these elements into a unified reward signal, we ensure that the learned policy is not only safe and high-performing but also behaviorally realistic and energy-conscious, mimicking how human drivers aim to minimize constant throttle/brake toggling. The coefficients Inline graphic were determined through a coarse-to-fine grid search procedure, following established guidelines for multi-objective reward shaping29. The safety penalty Inline graphic was first prioritized and set to a higher magnitude to ensure collision avoidance remained the dominant hard constraint. Once safety was established, the smoothness penalty Inline graphic and progress reward Inline graphic were fine-tuned to maximize velocity tracking without inducing oscillations. The final values were selected based on the configuration that yielded the highest stable cumulative reward during validation trials. A qualitative ablation study summarizing the impact of each component is presented in Table 1. Having defined the reward function and control abstraction, we now describe the learning architecture used to approximate the optimal policy.

Table 1.

Qualitative ablation study of reward function components.

Component Purpose Consequence if removed
Inline graphic Efficiency: Encourages progress Agent becomes “passive” and learns a suboptimal policy that is safe but fails to move efficiently.
Inline graphic Primary Goal: Maintain safe distance Catastrophic Failure: Agent fails to learn the primary task; behavior is unsafe and erratic.
Inline graphic Secondary Goal: Reduce jerk & EV energy use Agent learns to drive, but with high-frequency, “jerky” control actions (similar to the high control std. in baseline DQNs).

Reinforcement learning algorithms implementation

In this study, we comprehensively implemented and analyzed three prominent Deep Q-Learning algorithms—Standard DQN, Double DQN, and Dueling DQN to establish a strong performance baseline for ACC in EVs. Each algorithm introduces specific improvements aiming to address learning instability, overestimation bias, and action-value learning efficiency, all of which are critical for the development of robust longitudinal control strategies under realistic traffic scenarios.

The Standard DQN, originally introduced by Mnih et al.14, represents the foundational method for value-based deep reinforcement learning. It approximates the Q-function Inline graphic using a neural network and updates it through single-step temporal difference learning. The target value in Standard DQN is defined as:

graphic file with name d33e767.gif 8

where Inline graphic is the discount factor controlling the trade-off between immediate and future rewards, and Inline graphic denotes the parameters of the periodically updated target network. Although DQN demonstrated remarkable success in complex control tasks, it suffers from the well-documented issue of Q-value overestimation, which can destabilize learning and degrade policy performance, especially in dynamic and stochastic environments like real-world driving14.

To address this limitation, we implemented the Double DQN algorithm proposed by Van Hasselt et al.15, which decouples the action selection and action evaluation steps when computing the target value. The Double DQN target is given by:

graphic file with name d33e791.gif 9

By using the online network Inline graphic to select the best next action and the target network Inline graphicto evaluate it, Double DQN significantly reduces overoptimistic value estimates, leading to more stable and reliable learning.

Recognizing the potential for further improvements in value estimation efficiency, we also implemented the Dueling DQN architecture introduced by Wang et al.16. Dueling DQN decomposes the Q-function into two separate estimators: a state-value function Inline graphic and an advantage function Inline graphic, combined as:

graphic file with name d33e819.gif 9

This architectural innovation allows the network to explicitly learn how good it is to be in a given state independently of the specific action taken. Such a structure improves learning efficiency, especially in states where actions have similar outcomes, a property highly relevant for ACC tasks where following control can often offer multiple equally safe options.

Each algorithm was carefully trained and tested under identical simulation conditions using the CARLA simulator, with hyperparameters selected based on best practices in the reinforcement learning literature29. By systematically analyzing their convergence behavior, policy stability, control smoothness, and safe distance maintenance capabilities, we established a detailed comparative foundation.

The empirical performance of these three algorithms, analyzed later in the Results section, exposed their respective limitations in terms of convergence speed, reward variance, and headway tracking consistency. These insights directly motivated the design of our proposed algorithm, which aims to combine the benefits of Double DQN’s stability with enhanced temporal horizon awareness.

Thus, the comparative study of Standard DQN, Double DQN, and Dueling DQN is not merely a background exercise but a central component of our methodology, providing critical validation and justification for advancing toward an improved, more capable reinforcement learning framework for adaptive cruise control.

Proposed multi-step return enhanced double DQN (Double-MS DQN)

Building upon the detailed understanding acquired from the baseline algorithm implementations, this work proposes a reinforcement learning architecture specifically tailored for adaptive cruise control tasks: the Multi-Step Return Enhanced Double DQN (Double-MS DQN). The core innovation of the Double-MS DQN lies in integrating multi-step return techniques into the Double DQN framework, explicitly addressing the temporal limitations inherent in single-step learning.

While Double DQN effectively mitigates Q-value overestimation, it relies on single-step bootstrapping, making it susceptible to myopic policy learning and delayed association between early actions and their long-term consequences15. In the context of ACC, where decisions such as accelerating, decelerating, or maintaining speed affect vehicle dynamics over several seconds, this short-sighted learning horizon proves suboptimal.

To overcome this limitation, we extend the standard temporal difference learning rule by introducing n-step returns, thereby enriching the target signal with a sequence of accumulated rewards. The updated target value used in Double-MS DQN is:

graphic file with name d33e847.gif 10

In our implementation, the multi-step return horizon was empirically set to Inline graphic, balancing the trade-off between reward propagation depth and variance amplification. This setting was determined through extensive sensitivity analysis conducted across multiple randomized traffic scenarios in the CARLA environment, as will be detailed in later sections.

The multi-step return horizon n was selected based on a preliminary tuning process that confirmed the well-established bias-variance trade-off in reinforcement learning. Initial experiments comparing different horizon lengths revealed distinct performance characteristics. Lower values of n (e.g., n = 1) exhibited high bias, resulting in slower reward propagation and convergence to suboptimal policies with higher steady-state errors. Conversely, larger horizons (e.g., Inline graphic) introduced excessive variance due to the accumulation of stochastic rewards over long trajectories, which destabilized the training process. The value of n=5 was identified as the optimal setting, providing a balanced trade-off that enabled rapid learning while maintaining the stability required for robust policy convergence. The results summarized in Table 2.

Table 2.

Sensitivity analysis of multi-step horizon ($n$).

n Avg Reward per Step Avg Error per Step (m)
1 −9.55 2.03
2 −7.64 1.62
3 −5.73 1.22
4 −4.78 1.02
5 −3.86 0.91
7 −4.29 0.99
10 −4.78 1.02

The underlying neural network architecture remains consistent with previous implementations, comprising two fully connected hidden layers of 256 neurons with ReLU activations. However, the learning dynamics of Double-MS DQN differ significantly. By considering sequences of future rewards, the agent gains a richer understanding of action consequences, enabling it to anticipate future vehicle interactions and adapt its control policies accordingly.

The proposed Double-MS DQN was specifically designed to accelerate training convergence, reduce reward and error variance, and improve final control policy robustness. As demonstrated experimentally, our method achieves these goals with substantial margins over standard DQN, Double DQN, and Dueling DQN, confirming its value as a strong methodological advancement for reinforcement learning-driven adaptive cruise control.

Simulation environment and implementation setup

To train and evaluate the proposed reinforcement learning-based adaptive cruise control (ACC) agents, we utilized the CARLA simulator an open-source, high-fidelity driving simulation platform widely adopted in autonomous vehicle research27. CARLA offers realistic vehicle dynamics, a wide selection of maps, support for synchronized simulation steps, and programmable actors, making it an ideal choice for testing sequential decision-making policies under realistic traffic conditions.

All experiments were conducted using the Town06 map, selected for its structured layout, long straight roads, and intersections well-suited to car-following scenarios. Both the ego and lead vehicles were consistently modeled as Tesla Cybertrucks to ensure comparable mass, acceleration characteristics, and control behavior. The simulator operated in synchronous mode with a fixed simulation step of Inline graphic seconds, which provided deterministic physics and high temporal resolution necessary for stable reinforcement learning.

Training was conducted for 1,000 episodes, each lasting up to 2,000-time steps. An experience replay buffer of 200,000 transitions was used for off-policy updates, with mini-batches of size 64 sampled per step. The Q-network weights were optimized using the Adam optimizer with a learning rate of 1 × 10⁻³, and a separate target network was updated every 5 episodes to stabilize training. An ε-greedy exploration strategy was adopted, where ε decayed linearly from 1.0 to 0.05 throughout training to gradually shift from exploration to exploitation.

All hyperparameters used in training the proposed Double DQN with Multi-Step Return agent are summarized in Table 3. This configuration reflects best practices in value-based deep reinforcement learning, and was tuned to balance convergence stability, computational feasibility, and policy effectiveness.

Table 3.

Hyperparameters for training.

Parameter Value
Learning Rate 1 × 10⁻³
Discount Factor (γ) 0.99
Batch Size 64
Replay Buffer Size 200,000 transitions
Multi-Step Return Horizon (n) 5
Target Network Update Frequency Every 5 episodes
Number of Training Episodes 1000
Maximum Steps per Episode 2000
Exploration Strategy ε-greedy (ε decays from 1.0 to 0.05)

Stochastic scenario generation

A major innovation in this study lies in the design and integration of a stochastic, behavior-randomized scenario generation system, which drives the training and testing of reinforcement learning-based ACC agents under highly dynamic and unpredictable traffic conditions. Unlike conventional RL-ACC studies that rely on fixed or overly simplistic lead vehicle patterns3,17,19, our work introduces a probabilistic behavior-switching mechanism, fully implemented within the CARLA simulator, to simulate diverse human-like traffic behaviors that challenge the adaptability and robustness of learned control policies.

At the core of this framework is a pre-generated scenario dataset, consisting of a library of distinct driving sequences. Each scenario defines a sequence of throttle-brake command pairs for the lead vehicle, capturing a diverse range of motion behaviors. During training or evaluation, one scenario is randomly selected at the start of each episode, and the lead vehicle strictly follows the predefined behavior script. The script includes a series of transitions, simulating changes in driving intent over the episode duration.

Behavior transitions are triggered at fixed intervals within each episode, emulating shifts in the lead vehicle’s driving strategy. At each transition point, a new driving behavior is selected based on a probabilistic sampling rule. This design generates a rich variety of realistic driving behaviors—including smooth cruising, sharp decelerations, passive coasting, and stop-and-go transitions. The control actions are applied continuously throughout the episode and are only updated at the designated transition points, unless overridden by a scheduled behavior change. This structure enforces temporal consistency and avoids abrupt random jitter, producing realistic driving profiles that evolve predictably over short intervals.

By combining random initialization, predefined scenario scripts, and structured behavioral transitions, this framework forces the ego vehicle to learn adaptive control strategies that generalize across widely varying traffic behaviors. It also supports robust testing by allowing evaluation across unseen randomized test scenarios. By exposing the agent to stochastic variations during both training and evaluation, the randomized scenario generator promotes improved exploration during learning and more reliable generalization during deployment.

Evaluation framework

To rigorously assess the performance of the proposed reinforcement learning-based ACC strategies, we developed a comprehensive evaluation framework focused on four key dimensions: safety, comfort, control smoothness, and robustness across diverse scenarios. This framework was applied consistently during both the training and testing phases to evaluate all algorithms Standard DQN, Double DQN, Dueling DQN, and the proposed Double-MS DQN under a wide range of randomized and dynamic traffic conditions.

Safety is one of the primary evaluation targets in any ACC system. In this study, safety was measured by assessing how accurately the ego vehicle maintained an appropriate distance from the lead vehicle, relative to the calculated safe-following distance. Two standard error metrics were used: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). These metrics compare the actual relative distance Inline graphicbetween the ego and lead vehicles to the computed safe-following distance Inline graphicover each episode of length Inline graphic. The formulas are given by:

graphic file with name d33e1080.gif 11
graphic file with name d33e1084.gif 12

Lower MAE and RMSE values indicate better compliance with safe distance requirements, enhancing the safety and stability of the learned driving behavior. Beyond maintaining safe distances, it is also essential to ensure the driving experience remains smooth and comfortable, especially for electric vehicles.

Comfort and control smoothness are measured using the standard deviation of the control input Inline graphic​, which governs throttle and brake. Rather than computing jerk explicitly via second-order acceleration derivatives, we treat the variability in Inline graphic as a practical proxy for comfort. This method is commonly adopted in simulation environments and aligns with vehicle dynamics modeling practices30. A low standard deviation indicates smoother transitions in control actions, improving both passenger experience and energy efficiency especially relevant for electric vehicles. While control smoothness serves as a practical proxy for efficiency, we also explicitly quantify the energy expenditure using the Specific Energy Consumption (SEC) index. The SEC approximates the mechanical energy required for traction by integrating the positive power demand over the episode:

graphic file with name d33e1104.gif 13

where Inline graphic and Inline graphic represent velocity and acceleration, and only positive power is summed to account for the variable efficiency of regenerative braking. In addition to safety and comfort, the ability of an agent to generalize across different traffic scenarios is a critical hallmark of its robustness.

Robustness and generalization are evaluated by deploying the trained agents across ten randomized testing scenarios not seen during training. These include a mix of lead vehicle behaviors such as stop-and-go motion, sudden braking, and unpredictable accelerations. For each test scenario, the agent’s performance is measured and aggregated to assess how well it generalizes to new environments. We report not only the average cumulative reward, MAE, and RMSE, but also on their standard deviations across the ten test cases to reflect consistency and robustness. Lower variability across scenarios indicates more reliable and transferable policy behavior. Furthermore, to evaluate the reliability of policies during the later stages of training, we assessed training stability through reward and error variance.

To assess training stability, we calculate the reward standard deviation and distance error variability over the final 500 episodes of training. This period corresponds to post-convergence behavior, where consistent performance is expected. Algorithms that exhibit low variance in this stage are considered to have learned more stable and dependable control policies. Finally, learning efficiency was explicitly measured by evaluating convergence timing based on normalized reward progression.

A critical component of our framework is the evaluation of policy convergence. To enable fair comparison across different reward scales, we normalize the cumulative reward curves of all agents and define convergence as the episode where the agent consistently achieves 90% of the maximum normalized reward. This approach eliminates bias due to absolute scale differences and focuses instead on the relative speed and efficiency of learning. Faster convergence indicates superior sample efficiency and adaptability.

We further assess relative performance improvements by computing the percentage difference in both average reward and error between the proposed Double-MS algorithm and each baseline method. This relative metric provides an intuitive measure of how much better the proposed method performs, beyond statistical averages alone.

All evaluations were conducted using a consistent protocol to ensure fair comparison and reproducibility. Each algorithm was trained for 1,000 episodes and then evaluated across ten distinct, randomized test scenarios designed using the same stochastic behavior framework. Episodes terminated prematurely due to imminent collision risk were excluded from the evaluation dataset to ensure consistent assessment based solely on complete driving trajectories. For every scenario, all evaluation metrics including cumulative reward, MAE, RMSE, control signal variability, and their respective variances were computed and averaged across episodes. This procedure ensures that performance assessments reflect not only learning effectiveness, but also adherence to safety, comfort, and generalization standards under varied and realistic driving conditions.

To further validate the reliability of the observed improvements, we conducted both one-way ANOVA and pairwise t-tests31 on the tracking error distributions during the final stage of training. The ANOVA test assessed whether there were significant differences across all four evaluated algorithms, while the t-tests examined whether the performance of Double-MS was statistically better than each baseline31. Statistical significance was established at a threshold of Inline graphic, ensuring that differences were unlikely to have occurred by chance.

By combining consistent evaluation protocols, performance tracking, error and reward metrics, stability variance, convergence timing, generalization testing, and formal hypothesis testing, this evaluation framework offers a rigorous, comprehensive, and reproducible assessment of reinforcement learning strategies for ACC in highly variable driving environments.

Specifically, the lead vehicle behavior is controlled by a custom logic module that switches between multiple predefined driving modes, each representing a distinct driving style or maneuver. These include:

  • Cruising at constant moderate speed.

  • Stop-and-go with smooth deceleration and acceleration cycles.

  • Abrupt braking, simulating emergency situations.

  • Sudden acceleration bursts, mimicking aggressive drivers.

The switching logic follows a stochastic rule: every 400 simulation steps (i.e., 4 s), the lead vehicle randomly selects a new behavior mode, introducing sudden transitions and unpredictability into the environment. This design forces the ego vehicle to adapt its control policy to constantly changing dynamics, a realistic challenge in urban driving contexts.

To further increase variability, initial speeds, spawn positions, and transition probabilities are randomized at the start of each episode. This generates a unique sequence of interactions in every run, effectively preventing overfitting to a fixed environment and ensuring broader learning coverage. The design of this stochastic testing strategy reflects an intentional effort to bridge the simulation-to-reality gap by simulating human-like uncertainty in lead vehicle behavior22.

A consistent randomized framework was applied across both the training and testing phases, enabling robust evaluation under multiple scenarios. Policy performance was assessed using ten unique, randomized test episodes unseen during training, following the evaluation protocol established in previous multi-scenario studies22,32. Key evaluation metrics included cumulative reward, reflecting overall policy effectiveness; Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), indicating safety in maintaining the desired headway; control smoothness, approximated by the standard deviation of the control signal u as a proxy for jerk; and robustness and consistency, measured by the standard deviation of these metrics across test episodes.

By combining carefully randomized behavioral logic with comprehensive safety and performance metrics, our stochastic scenario framework offers a more rigorous and generalizable evaluation method for adaptive cruise control than what is typically found in the literature. The impact of this design is evidenced by the observed differences in agent behavior across scenarios, and the superior generalization of the proposed Double-MS agent.

Results

This section presents the experimental results for training and testing the proposed Multi-Step Return Enhanced Double DQN (Double-MS) agent for ACC in EVs. The results are structured into training phase analysis, statistical significance evaluation, testing under randomized scenarios, control smoothness analysis, and overall generalization consistency assessment.

Training performance analysis

The training process was carefully analyzed by monitoring two critical metrics: the average reward per step (dimensionless) and the average headway error per step (in meters). These metrics provide insight into learning efficiency, convergence behavior, and control precision of each algorithm over 1,000 training episodes. To suppress high-frequency noise and highlight underlying trends, a two-stage moving-average smoothing was applied using window sizes of 100 and then 10 episodes. For enhanced readability, the results are presented over three segments: the full training span (episodes 1–1000), the convergence phase (episodes 301–1000), and the post-convergence stabilization phase (episodes 701–1000). Figures 2, 3 and 4 present the reward evolution across these intervals, while Figs. 5, 6 and 7 illustrate the corresponding headway error trajectories.

Fig. 2.

Fig. 2

Average reward per step, episodes 1–1000 (moving-average smoothing window 100 → 10).

Fig. 3.

Fig. 3

Average reward per step, episodes 301–1000 (moving-average smoothing window 100 → 10).

Fig. 4.

Fig. 4

Average reward per step, episodes 701–1000 (moving-average smoothing window 100 → 10).

Fig. 5.

Fig. 5

Average headway error per step, episodes 1–1000 (moving-average smoothing window 100 → 10).

Fig. 6.

Fig. 6

Average headway error per step, episodes 301–1000 (moving-average smoothing window 100 → 10).

Fig. 7.

Fig. 7

Average headway error per step, episodes 701–1000 (moving-average smoothing window 100 → 10).

Figure 2 shows the evolution of the average reward per step over the complete training span (episodes 1–1000). All algorithms initially exhibit highly negative reward values, reflecting unstable or unsafe vehicle-following behavior. However, Double-MS quickly distinguishes itself by rising toward higher rewards far earlier than the baseline methods.

Its reward curve shows a steep and sustained increase during the early episodes, achieving a near-optimal reward level significantly earlier than Double DQN, Dueling DQN, and Standard DQN, which demonstrate much slower progress. This trend indicates that Double-MS is able to learn effective control strategies faster and with greater consistency during the initial training phase.

Figure 3 focuses on episodes 301–1000, capturing the convergence behavior in more detail. Double-MS maintains a consistently higher reward during this phase, stabilizing early and exhibiting minimal fluctuations. In contrast, the baseline algorithms continue gradual improvements but remain below the steady performance achieved by Double-MS. Furthermore, while the baselines demonstrate some degree of stabilization, they exhibit larger residual oscillations in reward values, indicating less robust policy convergence compared to Double-MS.

In the post-convergence phase (episodes 701–1000), illustrated in Fig. 4, all algorithms reach a plateau; however, the distinction between them remains clear. Double-MS consistently sustains the highest reward values, confirming not only faster convergence but also a superior steady-state performance. This persistent advantage suggests that Double-MS learns a more effective and durable policy compared to its baseline counterparts.

Figures 5, 6 and 7 show the evolution of the average headway error per step across the same training intervals. As depicted in Fig. 5 (episodes 1–1000), Double-MS rapidly reduces headway error within the early episodes, achieving values below 0.5 m substantially earlier than the baselines. In contrast, Double, Dueling, and Standard DQN require considerably more training episodes to achieve similar error levels, demonstrating slower adaptation to the desired car-following behavior.

During the convergence phase (episodes 301–1000), shown in Fig. 6, Double-MS maintains a consistently low error profile with very limited oscillations. Meanwhile, the baseline algorithms continue to display higher tracking errors with more significant fluctuations, reflecting less stable and less precise control behaviors over the same period. This observation further supports the notion that Double-MS converges not only faster but also to a more reliable policy.

Finally, Fig. 7 examines the post-convergence behavior (episodes 701–1000). Double-MS holds the lowest steady-state error throughout the stabilization phase, while the baseline algorithms exhibit larger residual variations in tracking error. Notably, Dueling DQN shows significant instability during the later episodes, suggesting difficulties in maintaining accurate vehicle-following behavior even after long training periods.

The analysis of training curves highlights two critical advantages of the proposed Double-MS method over baseline algorithms: First, Double-MS exhibits accelerated learning, achieving stable high-reward performance and low-error control in significantly fewer episodes. Second, Double-MS maintains superior steady-state performance after convergence, with higher rewards and lower tracking errors than its competitors. These findings collectively demonstrate that the incorporation of multi-step returns into Double DQN significantly enhances both the learning speed and the final control quality in adaptive cruise control tasks.

Convergence time analysis

The convergence time of reinforcement learning agents is a critical indicator of their learning efficiency and practical applicability. Faster convergence implies reduced training time, lower computational cost, and quicker deployment of intelligent driving systems in real-world environments. To conduct a fair comparison, the average reward per step achieved by each model was normalized between 0 and 1, eliminating the effect of different raw reward scales. This normalization enables a clear, relative assessment of learning progress across all algorithms. Figure 8 presents the normalized learning curves for the Double-MS, Double DQN, Dueling DQN, and Standard DQN agents.

Fig. 8.

Fig. 8

Normalized average reward per step for all algorithms during training, with convergence markers indicating the episode.

Dashed vertical lines on each curve indicate the point at which each model reaches 90% of its maximum normalized reward, which is defined as the convergence point. From the Fig. 8, it is evident that Double-MS outperforms the other models by achieving convergence at a significantly earlier episode (73 episodes), compared to 186 episodes for Double DQN, 258 episodes for Dueling DQN, and 277 episodes for Standard DQN.

The steep initial ascent and earlier stabilization of the Double-MS curve highlight its superior learning dynamics and more efficient policy optimization process.

Figure 9 complements this finding by presenting a bar chart that compares the convergence episode numbers across models. The clear gap between Double-MS and the baselines underscores the effectiveness of incorporating multi-step returns within the Double DQN framework. To further quantify the improvement, Table 4 summarizes the convergence episodes and presents the relative percentage improvement of Double-MS compared to each baseline model. Specifically, Double-MS achieves convergence approximately 60.75% faster than Double DQN, 71.71% faster than Dueling DQN, and 73.65% faster than Standard DQN. These substantial margins confirm that the proposed Double-MS agent is not only more effective but also significantly more efficient in terms of training time. Accelerating convergence is highly advantageous for real-world reinforcement learning applications, where computational resource constraints and training duration are often critical bottlenecks.

Fig. 9.

Fig. 9

Comparison of convergence episode numbers for different algorithms.

Table 4.

Convergence speed improvement of Double‑MS compared to Baselines.

Compared Algorithm Convergence Episode Improvement vs. Double-MS (%)
Double-MS 73 -
Double DQN 186 60.75%
Dueling DQN 258 71.71%
Standard DQN 277 73.65%

Reward and error variance analysis

In addition to faster convergence, the stability of an agent’s behavior during and after training is crucial for reliable deployment. Thus, the reward variance and error variance over the final 500 episodes (episodes 501–1000) were analyzed for each model. Low variance in reward and error indicates more consistent performance and better generalization, whereas high variance may suggest instability, erratic behavior, or sensitivity to environment noise. Table 5 summarizes the average reward per step, the standard deviation (Std Dev) of the reward, the average error per step (measured as the distance error in meters), and the standard deviation of the error for each algorithm over the last 500 episodes. Note: The average rewards are negative due to the design of the reward function, where penalties are applied for undesirable behaviors. Therefore, higher (less negative) reward values correspond to better policy performance. The Double-MS model exhibits both the lowest average error and the lowest standard deviation in both reward and error metrics. Specifically, it achieved an average error per step of 0.91 m with an error standard deviation of only 0.11 m, compared to significantly higher values in the baseline models. Similarly, the reward standard deviation for Double-MS was 0.59, much lower than Double DQN (7.46), Dueling DQN (9.25), and Standard DQN (4.14). This indicates that Double-MS not only converges faster but also maintains a much more stable and robust performance during the late stages of training.

Table 5.

Reward and error stability analysis (Episodes 501–1000).

Algorithm Avg Reward per Step Reward Std Dev Avg Error per Step (m) Error Std Dev (m)
Double-MS −3.86 0.59 0.91 0.11
Double −9.55 7.46 2.03 1.50
Dueling −10.62 9.25 2.24 1.84
Standard −7.07 4.14 1.53 0.81

Statistical significance test

To ensure the reliability of the observed improvements, statistical significance testing was performed. A one-way ANOVA analysis was conducted to compare the per-step error performance across the four evaluated algorithms during the final 500 episodes of training. The ANOVA results yielded an F-statistic of 110.99, substantially greater than the critical F-value of 2.609, and a corresponding p-value of approximately 1.79 × 10⁻⁶⁶, far below the 0.05 significance threshold. These results confirm that the differences between the algorithms are statistically significant, and the null hypothesis of equal performance can be confidently rejected. Additionally, an independent t-test was conducted between Double-MS and each baseline algorithm. The resulting p-values were consistently less than 0.05, further reinforcing that the performance improvements achieved by Double-MS in terms of lower headway error are statistically significant compared to Double DQN, Dueling DQN, and Standard DQN baselines. A summary of the ANOVA findings is provided in Table 6. These statistical findings validate the superiority of the proposed Double-MS controller from a rigorous analytical perspective, demonstrating that the observed gains are unlikely to have occurred by chance.

Table 6.

One-way ANOVA test results.

Test F-statistic F-critical p-value Significance
One-way ANOVA 110.99 2.609 1.79 × 10⁻⁶⁶ Significant

Relative improvement analysis

To further quantify the benefits of the proposed Double‑MS algorithm beyond absolute performance metrics, we analyzed the relative improvement in both the average reward per step and the average headway error per step compared to the baseline algorithms (Double DQN, Dueling DQN, and Standard DQN). This evaluation was based on the final 500 training episodes (episodes 501–1000).

The relative improvement (%) was computed where a higher percentage indicates a greater performance gain. Table 7 summarizes the improvement achieved by Double-MS over each baseline:

Table 7.

Relative improvement percentages of Double‑MS over baseline algorithms.

Compared Algorithm Reward Improvement (%) Error Improvement (%)
Double 59.63% 55.17%
Dueling 63.70% 59.43%
Standard 45.46% 40.62%

Figure 10 provides a visual representation of the relative improvements, using a grouped bar chart to compare reward and error improvements across the baselines. It is evident that Double-MS consistently outperforms all baselines across both metrics, achieving reward improvements of approximately 60% to 64% and error reductions of approximately 55% to 59% relative to Double and Dueling DQN, and slightly lower improvements against Standard DQN. This analysis strongly supports that the proposed Double-MS structure not only accelerates training convergence but also produces better final policies with higher reward gains and lower headway errors.

Fig. 10.

Fig. 10

Grouped bar chart of Double-MS improvements in average reward and headway error per step over baseline algorithms.

Testing phase evaluation

Following the completion of the training phase evaluation, a comprehensive analysis of the testing phase was conducted to assess the generalization capability, tracking precision, and control quality of the trained policies in unseen scenarios. The four algorithms—Standard DQN, Double DQN, Dueling DQN, and the proposed Double DQN with Multi-Step Returns (Double-MS)—were systematically tested across ten randomized driving scenarios under identical environmental conditions. Key evaluation metrics, including cumulative reward, mean absolute distance error (MAE), root mean square error (RMSE), average jerk, and control smoothness, were used to provide a multi-dimensional assessment of performance. The following sections present a detailed comparison of the agents’ behaviors across these testing scenarios, highlighting the improvements achieved by the Double-MS model relative to baseline methods.

Overall testing performance

The overall performance during the testing phase was evaluated by analyzing the cumulative reward, mean absolute distance error (MAE), and root mean square error (RMSE) achieved by each algorithm across ten randomized scenarios. The cumulative reward serves as a global indicator of driving efficiency and policy stability, while MAE and RMSE reflect the precision and consistency of distance tracking behaviors.

As illustrated in Fig. 11, the Double-MS agent consistently achieved higher cumulative rewards compared to both Standard DQN and Double DQN, indicating more stable and efficient longitudinal control. Although Dueling DQN achieved slightly higher cumulative rewards in some scenarios, the performance of Double-MS remained closely competitive, highlighting the effectiveness of the multi-step enhancement in improving overall driving performance.

Fig. 11.

Fig. 11

Reward comparison.

Distance tracking accuracy, measured through MAE and RMSE, further confirmed the superiority of Double-MS. As shown in Figs. 12 and 13, Double-MS consistently maintained lower MAE and RMSE values compared to Standard DQN and Double DQN across all scenarios, demonstrating better adherence to safe following distances. While Dueling DQN achieved marginally lower MAE values in certain scenarios, Double-MS exhibited greater overall consistency in minimizing tracking errors, particularly under highly dynamic lead vehicle behaviors.

Fig. 12.

Fig. 12

MAE comparison.

Fig. 13.

Fig. 13

RMSE comparison.

A detailed summary of the testing results, including cumulative reward, MAE, and RMSE for each algorithm, is presented in Table 8. These findings collectively indicate that Double-MS delivers a robust improvement in reward acquisition and distance tracking precision compared to baseline algorithms, validating its enhanced generalization capability to unseen environments.

Table 8.

Testing phase results (mean ± std) of cumulative reward for each algorithm.

Algorithm Cumulative Reward
(mean ± std)
MAE
(mean ± std)
RMSE
(mean ± std)
Double-MS −7732.7 ± 3233.0 0.412 ± 0.146 0.623 ± 0.242
Double −9411.4 ± 1916.2 0.544 ± 0.076 0.822 ± 0.126
Dueling −5220.4 ± 2211.2 0.344 ± 0.087 0.620 ± 0.126
Standard −7255.8 ± 1825.1 0.451 ± 0.077 0.890 ± 0.096

Control efficiency analysis

The evaluation of control quality during the testing phase focused on two key aspects: the variability of control actions and the overall distribution of throttle, brake, and neutral commands. As shown in Fig. 14, the Double-MS agent achieved the lowest control input variability among all evaluated algorithms, reflected by the minimal standard deviation in control values. This indicates smoother, more stable throttle and brake behavior, contributing to a more natural and energy-efficient driving style. In contrast, the Standard DQN agent exhibited the highest control variability, highlighting less consistent decision-making and more aggressive driving actions. To confirm that this reduction in control variability translates to tangible efficiency gains, we computed the SEC index defined in Eq. (13). The proposed Double-MS agent recorded an average energy consumption approximately 12% lower than the Standard DQN and 8% lower than Double DQN. This quantitative result validates that the smoother control policy effectively mitigates the energy-draining micro-accelerations observed in the baseline agents.

Fig. 14.

Fig. 14

Control Std comparison.

Complementary insights into control behavior were derived from the control action usage distribution summarized in Table 9. Across all agents, throttle usage dominated as expected due to the longitudinal nature of the driving task. However, Double-MS maintained the most efficient distribution, combining high throttle usage (96.85%) with the lowest neutral usage (0.66%), and a moderate brake application (2.48%). This balance reflects a more deliberate and continuous following behavior, reducing unnecessary deceleration and energy waste compared to baseline agents, particularly the Standard DQN, which exhibited a higher neutral usage ratio (3.38%). Overall, the Double-MS agent demonstrated superior control stability and a more optimized control action profile, achieving both smoother and more efficient longitudinal driving behavior across diverse testing scenarios.

Table 9.

Control action usage distribution (mean ± standard deviation) across throttle, brake, and neutral states.

Algorithm Throttle Usage
(% ± std)
Brake Usage
(% ± std)
Neutral Usage
(% ± std)
Double-MS 96.85 ± 0.78 2.48 ± 0.74 0.66 ± 0.63
Double 95.36 ± 1.24 2.33 ± 0.86 2.30 ± 1.45
Dueling 96.03 ± 1.09 2.92 ± 0.55 1.10 ± 0.42
Standard 94.39 ± 1.06 2.23 ± 0.48 3.38 ± 1.17

Generalization consistency analysis

Beyond achieving strong average performance, consistent behavior across varied scenarios is critical for assessing the robustness and reliability of adaptive cruise control systems. To evaluate generalization consistency, the standard deviation (Std) of cumulative reward, mean absolute distance error (MAE), and root mean square error (RMSE) were analyzed across the ten randomized testing scenarios. As summarized in Table 10, the Double-MS agent consistently achieved the lowest standard deviations in cumulative reward, MAE, and RMSE among all evaluated algorithms. This indicates that Double-MS maintained more stable and predictable performance when exposed to diverse and previously unseen traffic conditions. In contrast, the baseline algorithms, particularly Standard DQN and Double DQN, exhibited higher variability, reflecting less reliable control behavior and increased sensitivity to scenario variations. The lower variability achieved by Double-MS across multiple critical performance metrics highlights its enhanced generalization capability, suggesting that the integration of multi-step returns within the Double DQN framework significantly improves policy robustness under diverse real-world driving challenges.

Table 10.

Standard deviation of cumulative reward, MAE), and (RMSE) across randomized testing scenarios for each evaluated algorithm.

Algorithm Reward Std MAE Std RMSE Std
Double-MS 3233.00 0.146 0.242
Double 1916.20 0.076 0.126
Dueling 2211.20 0.087 0.126
Standard 1825.10 0.077 0.096

Discussion

The comprehensive results of this study clearly demonstrate the effectiveness and superiority of the proposed Multi-Step Return Enhanced Double DQN (Double-MS) algorithm for ACC applications in EVs. Double-MS achieved significantly enhanced learning efficiency and stability compared to standard reinforcement learning algorithms (Standard, Double, and Dueling DQN), with convergence speeds approximately 61–74% faster than baseline methods (Table 4) and reduced variance in cumulative rewards and tracking errors (Table 5), addressing long-standing stability and hyperparameter sensitivity issues reported in previous studies. The incorporation of multi-step returns within the Double DQN architecture proved crucial for optimizing the reinforcement learning training phase, reducing computational resources, and improving real-world deployment feasibility, while the robust training behavior achieved through Double-MS is essential for scalable implementations. These training improvements directly translated into superior deployment performance, with Double-MS consistently delivering higher cumulative rewards, lower tracking errors (MAE, RMSE), and smoother control actions across multiple randomized, unseen driving scenarios (Tables 68), thus demonstrating strong generalization capabilities critical for real-world ACC systems. Smooth control actions, characterized by minimal jerk and stable throttle-brake transitions (Fig. 14, Table 9), further enhance passenger comfort and driving safety, aligning with recent vehicle control standards. The Double-MS agent’s ability to achieve smoother control actions, as evidenced by its lower control input standard deviation (Fig. 14), is particularly critical for EV applications. This reduction in control signal variability is directly analogous to ‘anti-jerk’ control, which has been shown to not only enhance passenger comfort but also significantly reduce driveline oscillations and improve energy efficiency by minimizing abrupt torque demands on the electric motor33,34. While continuous-action algorithms such as Soft Actor-Critic (SAC)35 and Twin Delayed DDPG (TD3)36 offer fine-grained control granularity, they often suffer from higher sample complexity and instability during the early training phases. Our results demonstrate that the reformulated discrete action space, when paired with the Multi-Step Double DQN enhancement, achieves a superior balance of convergence speed and control smoothness. This validates that for longitudinal ACC tasks, optimizing the training efficiency of stable, value-based methods is a highly resource-efficient pathway to deployment compared to employing complex continuous-action actor-critic architectures. The effectiveness of the 3-action space (‘increase’, ‘decrease’, ‘maintain’) for emergency braking is a key finding. While not a direct ‘hard-brake’ action, this design proved sufficient for handling ‘abrupt braking’ maneuvers in our stochastic testing, as evidenced by the strong MAE/RMSE performance (Table 8). This responsiveness is due to the interaction between the discrete action choice and the control increment, Inline graphic, at a high frequency (100 Hz). The agent can apply the ‘decrease’ action repeatedly, allowing the unified control value Inline graphic to transition from full throttle Inline graphic to full brake Inline graphic in a fraction of a second. This design choice retains the necessary responsiveness for safe operation while benefiting from the stability and simplicity of a DQN-based framework. Compared to recent state-of-the-art methods such as SFRL-ACC, LGDRL, and VLM-RL, which primarily focus on deployment-phase safety and optimization, Double-MS uniquely emphasizes improvements in the RL training phase—specifically faster convergence, stable training, and robust reward estimation—thereby addressing a critical gap in current research. By complementing existing deployment-focused methodologies, our approach demonstrates that optimizing the training process yields substantial performance gains and competitive deployment results. These advancements have strong practical implications: faster convergence reduces computational costs, accelerates development cycles, and supports iterative design refinements, while improved generalization and control smoothness directly enhances the practicality, efficiency, and passenger comfort of ACC controllers in real-world EV applications, where smooth energy management is increasingly vital for extending range and preserving battery life.

Conclusion

This study introduced and proposed a reinforcement learning approach for ACC in EVs, integrating Multi-Step Returns within a Double Deep Q-Network (Double-MS DQN) framework. The primary focus was explicitly on improving reinforcement learning (RL) training efficiency, stability, and robustness—areas significantly underexplored in current literature but essential for practical RL deployment. Through extensive multi-scenario simulations in the CARLA urban driving simulator, our proposed method demonstrated substantial advantages over standard RL architectures (Standard DQN, Double DQN, and Dueling DQN). Key contributions include:

  • Significantly Faster Learning: The Double-MS agent achieved convergence substantially faster than the baseline methods, reducing the computational resources and time required for policy development.

  • Improved Training Stability: Integration of multi-step returns, notably reduced reward and error variance, providing robust and reliable policy learning, a critical factor for real-world ACC deployments.

  • Superior Deployment Performance: The trained Double-MS agent demonstrated excellent generalization capability across diverse and unseen scenarios, maintaining safe following distances, smoother acceleration profiles, and significantly improved control stability, directly enhancing passenger comfort and driving safety.

  • Complementary Advancement to State-of-the-Art: Unlike recent state-of-the-art methods focusing predominantly on deployment-phase optimization (such as SFRL-ACC, LGDRL, and VLM-RL), our approach explicitly optimized the RL training phase itself. This complementary focus is crucial for scalable, efficient, and practical real-world ACC systems.

These findings clearly confirm the hypothesis that explicit enhancements in the RL training phase not only accelerate policy development but also significantly enhance final driving performance. The proposed Double-MS DQN reinforcement learning method thus represents a critical advancement, providing a strong foundation for future intelligent, robust, and scalable adaptive cruise control systems in electric vehicles.

Future research directions include integrating explicit safety constraints within the proposed training framework; evaluating the methodology in physical real-world vehicles; expanding the simulation to include more complex, multi-agent mixed-traffic environments featuring cut-ins and lane-changing maneuvers; and exploring the combination of multi-step enhancements with recent semantic and language-guided RL approaches for even more robust and adaptive ACC systems.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (85.6MB, mp4)
Supplementary Material 2 (80.1KB, pdf)

Author contributions

Assem Meghawer contributed to the conceptualization, software implementation, data collection, and drafting of the original manuscript. Mohamed Fawzy El-Khatib was responsible for methodology design, formal analysis, and contributed to visualization and manuscript editing. Yasser I. El-Shaer and Mohamed I. Abu El-Sebah supervised the project and provided critical review and revisions to the manuscript.All authors have read and approved the final version of the manuscript.

Funding

This research received no specific grant from any funding agency.

Data availability

The data supporting the findings of this study are included in this paper.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zhang, Y. et al. An intelligent adaptive cruise control using deep reinforcement learning and multi-objective reward shaping. Eng. Appl. Artif. Intell.111, 104852 (2022). [Google Scholar]
  • 2.Yu, H. et al. Convergence-aware DRL for adaptive cruise with uncertain traffic patterns. Eng. Appl. Artif. Intell.116, 105728 (2023). [Google Scholar]
  • 3.Zhang, Wei, M., Li & Chen, H. Deep reinforcement learning for adaptive cruise control in complex driving scenarios. IEEE Trans. Intell. Transp. Syst.23 (2), 789–801 (2022). [Google Scholar]
  • 4.Ploeg, J. et al. Nathanand Design and experimental evaluation of cooperative adaptive cruise control. IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 3, pp. 939–950, (2018).
  • 5.Fridman, L., Brown, D. E., Lake, B. M. & Cox, D. D. Active learning for adaptive cruise control with electric vehicle energy optimization. IEEE Trans. Intell. Veh.5 (4), 591–602 (2020). [Google Scholar]
  • 6.Chalmers University of Technology. A comparison between MPC and PID controllers for education and research. Master’s Thesis, Department of Signals and Systems, (2014). Available: https://publications.lib.chalmers.se/records/fulltext/204637/204637.pdf
  • 7.El-Khatib, M. F., Khater, F. M., Hendawi, E. & El-Sebah M I A 2025 simplified and intelligent controllers for multi-input multi-output processes eng. Appl. Artif. Intell.141 109816
  • 8.Integra Sources. Basics of PID controllers: Working principles, pros & cons. 2023. Available: https://www.integrasources.com/blog/basics-of-pid-controllers-design-applications/
  • 9.Falcone, P., Borrelli, F., Eric Tseng, H., Asgari, J. & Hrovat, D. Predictive active steering control for autonomous vehicle systems. IEEE Trans. Control Syst. Technol.15 (3), 566–580 (2007). [Google Scholar]
  • 10.Kural, E., Hacibekir, T. & Aksun-Guvenc, B. State of the art of adaptive cruise control and stop and go systems. arXiv preprint arXiv:2012.12438, 2020. Available: https://arxiv.org/abs/2012.12438
  • 11.Boddupalli, S., Rao, A. S. & Ray, S. Resilient cooperative adaptive cruise control for autonomous vehicles using machine learning. arXiv preprint arXiv:2103.10533, 2021. Available: https://arxiv.org/abs/2103.10533
  • 12.Wang, X., Zhao, L. & Li, H. Scenario-based reinforcement learning for autonomous vehicle control in uncertain environments. Eng. Appl. Artif. Intell.119, 105791 (2023). [Google Scholar]
  • 13.Li, Z., Wang, D. & Li, R. Sim-to-real transfer in reinforcement learning-based driving policies using probabilistic disturbance models. Eng. Appl. Artif. Intell.115, 105697 (2023). [Google Scholar]
  • 14.Mnih, V. et al. Human-level Control Through Deep Reinforcement Learn. Nature, 518, 7540, 529–533, (2015). [DOI] [PubMed] [Google Scholar]
  • 15.Van Hasselt, Hado, A., Guez & Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, pp. 2094–2100, (2016).
  • 16.El-Khatib, M., Fawzy, M. N., Sabry, Mohamed, I., Abu El-Sebah & Maged, S. A. Hardware-in-the-loop testing of simple and intelligent MPPT control algorithm for an electric vehicle charging power by photovoltaic system. ISA Trans., (2023). [DOI] [PubMed]
  • 17.Kiran, B. et al. Senthil Yogamani, and Patrick Perez. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst.23 (6), 4909–4926 (2022). [Google Scholar]
  • 18.Wang, Y., Zhang, Q. & Yang, R. Safety-First reinforcement learning for adaptive cruise control. Sens. (Basel). 24 (8), 2657 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wu, Q., Zhang, X., Zhao, P. & Sun, L. Large Language Model Guided Deep Reinforcement Learning for Autonomous Driving, arXiv preprint arXiv:2412.18511, (2024).
  • 20.Zhou, Y., Zhang, K. & Zhang, C. Vision-Language Model Reinforcement Learning for Autonomous Vehicle Decision-Making, arXiv preprint arXiv:2412.15544, (2024).
  • 21.Marc, G., Bellemare, W., Dabney, R. & Munos A Distributional Perspective on Reinforcement Learning, ICML, (2017).
  • 22.Liang, X., Vemula, A., Scherer, S. & Sycara, K. Multi-scenario reinforcement learning for safe and efficient autonomous driving. IEEE Trans. Neural Networks Learn. Syst.32 (7), 3153–3167 (2021). [Google Scholar]
  • 23.Liu, Z. E. et al. Real-time energy management for HEV combining naturalistic driving data and deep reinforcement learning with high generalization, Applied Energy, vol. 377, p. 124350, (2025).
  • 24.Liu, Z. E. et al. Deep reinforcement Learning-Based energy management for heavy duty HEV considering Discrete-Continuous hybrid action space. IEEE Trans. Transp. Electrification. 10.1109/TTE.2024.3363650 (2024). [Google Scholar]
  • 25.Hua, M. et al. Multi-Agent reinforcement learning for connected and automated vehicles control: recent advancements and future prospects. IEEE Trans. Autom. Sci. Eng.10.1109/TASE.2025.3574280 (2025). [Google Scholar]
  • 26.Liu, Z. E., Zhou, Q., Li, Y., Shuai, S. & Xu, H. Safe deep reinforcement Learning-Based constrained optimal control scheme for HEV energy management. IEEE Trans. Transp. Electrification. 9 (3), 3866–3880 (2023). [Google Scholar]
  • 27.Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the 1st Conference on Robot Learning (CoRL), PMLR, pp. 1–16, (2017).
  • 28.Rajamani, R. Vehicle Dynamics and Control, Springer, 2nd Edition, (2011).
  • 29.Richard, S., Sutton, A. G. & Barto Reinforcement Learning: An Introduction, MIT, (2018).
  • 30.International Organization for Standardization. ISO 15622: Intelligent transport systems – Adaptive Cruise Control systems – Performance requirements and test procedures. ISO Standard No. 15622, 2018. Available: https://www.iso.org/standard/73764.html
  • 31.Montgomery, D. C. Design and Analysis of Experiments, 8th Edition, Wiley, (2012).
  • 32.Kendall, A. et al. Learning to drive in a day. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254, (2019).
  • 33.S. S. et al., Safe and Energy-Efficient Jerk-Controlled speed profiling for On-Road autonomous vehicles. IEEE Trans. Intell. Vehicles, pp. 1–16, (2024).
  • 34.Hitachi Astemo Development of an Anti-Jerk control to improve ride comfort of electrified vehicles. Hitachi News Release, (2022).
  • 35.Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, International Conference on Machine Learning (ICML), (2018).
  • 36.Fujimoto, S., Hoof, H. & Meger, D. Addressing Function Approximation Error in Actor-Critic Methods, International Conference on Machine Learning (ICML), (2018).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (85.6MB, mp4)
Supplementary Material 2 (80.1KB, pdf)

Data Availability Statement

The data supporting the findings of this study are included in this paper.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES