Summary
This review examines reinforcement learning (RL) methods for dynamic speed control in connected and autonomous vehicle (CAV) environments, covering variable speed limits, platooning, and speed harmonization. Focusing on studies from 2017 to 2025, it analyzes algorithmic choices (value-based, policy-gradient, actor-critic, and multi-agent RL), state-action design, and reward engineering, as well as deployment assumptions on communication, penetration rates, and mixed traffic. Simulation results generally indicate improvements in safety (≈8%–50%), traffic efficiency (≈7%–57%), fuel consumption (≈6%–20%), and throughput (≈12%–30%), with multi-agent approaches performing more robustly at moderate CAV penetration (30%–50%). However, benefits are highly scenario dependent and often rely on idealized communication, limited fleet sizes, and non-standardized evaluation. Real-world tests remain scarce and consistently underperform their simulated counterparts, highlighting a significant sim-to-real gap. The review identifies key research priorities in scalable multi-agent-RL (MARL) architectures, safety-constrained learning, robust sim-to-real transfer, and standardized benchmarking to support deployment-oriented adoption of RL-based speed control in future CAV-enabled traffic systems.
Subject areas: applied sciences, engineering
Applied sciences; Engineering
Introduction
Traffic congestion on highways continues to be a serious and costly problem, especially in large cities. In the United States alone, delays and related inefficiencies were estimated to cost more than $151 billion in 2021.1,2 Similar problems appear in many other parts of the world, although the scale and causes differ. As summarized in Table 1, cities such as Addis Ababa and Kolkata experience long delays and increased emissions, but often for different reasons, including infrastructure limitations, traffic composition, and enforcement practices. These differences suggest that a single control strategy is unlikely to work equally well across all regions.
Table 1.
Annual delay hours, Carbon Dioxide (CO2) emission increase, and economic losses due to highway congestion in selected urban regions
| City/Region | Annual delay hours | CO2 emission increase | Economic losses | Comments |
|---|---|---|---|---|
| Addis Ababa, Ethiopia3 | 212 vehicle-hours (selected segments) | not specified | travel time = 74%; vehicle operating = 6%; unreliability = 20% of total cost | delay hours calculated for selected road segments; breakdown of economic loss components is available. |
| Kolkata, India4 | 4,160 h per commuter | not specified | not specified | high individual commuter delays reported; no CO2 or economic loss data provided. |
| Nairobi, Kenya5 | not specified | 25.3 million gram of carbon dioxide equivalent (gCO2e) annually; 73% from private cars | several million CFA francs per day | significant CO2 emissions from private vehicles; overall economic losses expressed in daily financial terms. |
| Ningbo, China6 | not specified | 15.5% reduction under optimal traffic distribution | not specified | focus on impact of traffic distribution on emissions; no detailed delay or cost data. |
| Grand Lome, Togo7 | up to 49.5 min lost daily (approx. 300 h/year) | not specified | several million CFA francs per day; 20%–42.94% road segments frequently congested due to lack of alternative routes | daily delays translated to annual figure; economic losses due to congestion quantified by monetary loss and infrastructure limitations. |
The uneven impact across cities points to a need for traffic control approaches that can react to local conditions, rather than relying on fixed or heavily tuned rules. Connected and autonomous vehicle (CAV) technologies have started to make this kind of responsiveness more realistic. With real-time sensing and vehicle-to-vehicle and vehicle-to-infrastructure communications, vehicles can exchange information that was previously unavailable to traditional control systems. Recent advances in communication networks and edge computing have made such interactions faster and more reliable.8
Several control techniques have been explored within this context. Dynamic speed harmonization (DSH), for example, adjusts vehicle speeds to smooth traffic flow and reduce stop-and-go behavior, particularly in congested corridors.8 Cooperative adaptive cruise control (CACC) uses communication between vehicles to maintain spacing and improve stability within platoons.9 Some studies report that reinforcement learning (RL)-based CACC controllers can perform comparably to classical linear-quadratic approaches, but only when reward structures and training settings are carefully chosen.10 These results are encouraging, though they are often limited to controlled simulation setups.
RL has, therefore, been increasingly used as an added decision-making component in CAV-based traffic systems. Researchers have applied RL to problems such as speed limit adjustment, routing decisions, intersection control, and handling traffic disturbances.11,12,13,14 Multi-agent RL has also been used to coordinate actions across multiple vehicles and infrastructure elements, as conceptually shown in Figure 1.15 At the same time, reported performance gains differ substantially across studies, largely due to differences in modeling choices, training procedures, and evaluation environments.
Figure 1.
Conceptual RL-CAV framework for dynamic speed control on highways
The system integrates real-time sensor and V2X data, a reinforcement learning decision engine, vehicle-level control actions, and feedback from traffic flow conditions to optimize performance across safety, efficiency, and environmental dimensions. RL-CAV, reinforcement learning-connected and autonomous vehicle; V2X, vehicle-to-everything.
Even with recent advances, using RL at scale in CAV-based traffic systems remains difficult. From a technical standpoint, many RL approaches struggle when traffic becomes dense or when autonomous and human-driven vehicles share the road.12,16 A common issue is that experiments assume near-perfect communication or unrealistically high CAV participation rates. These assumptions simplify training and evaluation but do not align well with current existing traffic conditions.
Beyond technical concerns, non-technical barriers also play a role. Issues related to cybersecurity, including vulnerabilities in connected vehicle communication, remain largely unresolved.17 At the same time, vehicle-to-everything (V2X) standards are still evolving, and legal responsibility in automated driving scenarios is often unclear.18 These factors make it difficult to move RL-based control methods beyond experimental testing. Another recurring concern is the heavy reliance on simulation. Although simulation studies are useful for early development, the lack of real-world validation continues to limit confidence in how these methods would perform outside controlled environments.19,20
Rather than attempting to resolve these issues, this review focuses on examining how RL has been used so far for CAV speed control and where current approaches fall short. The discussion groups existing work according to the type of RL method applied and the traffic settings considered, and it examines how these choices affect performance, stability, and robustness. Attention is also given to how RL controllers interact with vehicle coordination mechanisms such as speed harmonization and cooperative motion. Although several surveys exist on CAV-based traffic control and RL in transportation, more broadly, they typically focus on general AV control, intersection management, or variable speed limits (VSLs) in isolation. In contrast, this review is restricted to RL for dynamic speed control in CAV settings (VSL, platooning, and speed harmonization) and places explicit emphasis on deployment readiness, including scalability, safety assurance, and sim-to-real transfer. The rapid growth of literature since 2020, particularly on multi-agent coordination and mixed-traffic scenarios, creates a timely need for this more focused consolidation.
While Table 1 summarizes reported economic and environmental costs of traffic congestion, these impacts vary considerably across cities due to differences in infrastructure capacity, vehicle composition, and traffic management practices. Cities with higher public transport usage, denser road networks, and more mature traffic management systems tend to exhibit lower congestion costs per vehicle, even under high demand conditions.1,2 In contrast, urban regions dominated by private vehicle use, limited lane management, and static control policies experience disproportionately higher economic losses and emissions associated with stop-and-go traffic.
Vehicle fleet composition further influences congestion impacts, as higher shares of heavy-duty vehicles and heterogeneous traffic flows amplify fuel consumption and emissions during congestion episodes.3,5 These differences indicate that congestion costs are not uniform and depend strongly on local infrastructure maturity and operational strategies. This distinction is important for policymakers when interpreting congestion cost estimates and assessing where adaptive and learning-based traffic control approaches may yield the greatest marginal benefits. Across the cities summarized in Table 1, differences in congestion costs can be traced to variations in roadway capacity and public transport provision, the relative share of private vehicles and freight traffic, and the extent to which active traffic management strategies are deployed.
The studies discussed in this review are mainly peer-reviewed publications between 2017 and 2025, drawn using sources such as IEEE Xplore, Scopus, and Web of Science. Selected examples are used to highlight recurring design choices, common evaluation practices, and unresolved challenges, rather than to provide an exhaustive survey of the field.
Background
This subsection explains the basic ideas behind using RL for speed control in CAV environments. It first describes how vehicles communicate and coordinate, then introduces the learning concepts typically used in traffic control, and finally discusses how these elements are applied in speed control systems.
CAV technologies
CAVs rely on a combination of sensing, communication, and automation to achieve coordination that conventional vehicles cannot support. Using vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications, vehicles are able to share information such as speed, position, and planned actions in real time. Broader V2X communication further extends this exchange to include infrastructure systems, other road users, and network services.21,22,23 V2V communication mainly supports coordination between nearby vehicles, while V2I links vehicles to traffic signals and roadside units. V2X communication expands the scope of information available to vehicles and supports more informed control decisions.23
How effective are these capabilities depends largely on how many CAVs are present in the traffic stream. When adoption is low, CAVs tend to influence traffic only locally by smoothing small speed variations. At higher penetration levels, more coordinated behaviors, such as platooning, begin to appear. When most vehicles are connected and automated, traffic flow can become more stable across a wider area.24,25 CAVs also enable control strategies such as “speed warning systems” for CAVs and CACC, which are intended to reduce stop-and-go driving and improve overall stability.21,26 Compared with traditional VSL systems, which often depend on fixed infrastructure and driver compliance, CAV-based methods can adjust speeds more quickly and at a finer spatial scale, with less reliance on roadside equipment.23,27
RL fundamentals for traffic control
RL provides a framework for learning control policies through interaction with dynamic and uncertain environments.28,29 In traffic applications, problems are commonly modeled as Markov decision processes (MDPs). States may include traffic density, vehicle speeds, or communication status, while actions typically involve speed changes, spacing control, or lane selection. Reward functions are designed to reflect safety, efficiency, comfort, and energy-related objectives.30
RL approaches in CAV applications can be broadly categorized as centralized or decentralized. Centralized methods aim to optimize system-wide performance but often face scalability and robustness issues. Decentralized methods allow individual vehicles to operate more independently, though coordination among agents becomes more challenging. Multi-agent reinforcement learning (MARL) has been used to balance these trade-offs by allowing agents to learn locally while sharing information or operating within structured coordination schemes. Deep RL techniques, including Deep Q-Network (DQN), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Multi-Agent Deep Deterministic Policy Gradient (MADDPG), are frequently applied to handle the non-linear and high-dimensional nature of traffic systems.31,32
Integrated RL-CAV framework for speed control
When combined, CAV technologies and RL methods form a closed-loop control system that adapts continuously to changing traffic conditions. As shown conceptually in Figure 1, sensor data and V2X information are processed by RL-based controllers, which generate control actions such as acceleration, deceleration, or spacing adjustments. These actions are applied in real time with the goal of improving safety, traffic flow, and environmental performance. In some implementations, RL is used alongside techniques such as model predictive control or safety potential fields to account for future states and collision risk.33
Despite these capabilities, deployment outside simulation remains limited. Challenges include maintaining stable behavior in safety-critical situations, meeting real-time computational constraints, and ensuring consistent performance under varying environmental and traffic conditions. Human-related factors, such as user trust and interactions with human-driven vehicles, also affect system behavior. In addition, regulatory requirements related to communication standards, controller certification, and cybersecurity continue to influence the pace of adoption.
Performance evaluation framework
Performance evaluation of RL-based CAV speed control relies on multiple metrics reflecting different traffic management objectives. Safety is commonly assessed using indicators such as time-to-collision, post-encroachment time, and braking intensity. Efficiency is measured through throughput, travel time, and capacity utilization, while environmental effects are evaluated using fuel consumption, emissions, or energy use.
Current deployment levels indicate that these technologies are still at an early stage. Although V2X protocols have been defined, large-scale implementation remains limited, and most commercially available systems operate at Society of Automotive Engineers (SAE) level 2 or 3 automation. As a result, RL-based control strategies are still validated primarily through simulation studies. Broader deployment requires extensive field testing, regulatory approval, and stronger safeguards against communication and cybersecurity risks.
RL problem formulation
Building on the foundational concepts discussed in the previous subsections, this subsection formalizes the application of RL to CAV speed control. It presents the mathematical structure, algorithmic design elements, and implementation principles that constitute the basis for the approaches evaluated in later sections.
Mathematical framework for CAV speed control
In many studies, CAV speed control using RL is framed as an MDP.34 This formulation is mainly used to describe how control decisions are made repeatedly as traffic conditions change over time. The MDP is usually expressed as a tuple (S,A,P,R,γ), although the exact definitions differ across implementations.
The state space S generally includes information about traffic and vehicle conditions, such as vehicle speeds, positions, acceleration, local density, and whether communication is available. The action space A corresponds to the control inputs applied by the vehicle, most often acceleration or deceleration, but, in some cases, it also corresponds to lane changes or spacing adjustments. State transitions P describe how traffic conditions evolve once these actions are applied. The reward function R is used to assess outcomes based on factors such as safety, traffic flow, fuel or energy use, comfort, and stability. The discount factor γ determines how much future outcomes influence current decisions.
Using this structure allows control policies to be learned through repeated interaction with traffic dynamics. However, the effectiveness of the resulting policy depends largely on how states, actions, and rewards are defined in practice.
Algorithmic approaches and selection criteria
A range of RL algorithms have been used for CAV speed regulation, each suitable for different task formulations and state-action characteristics. Value-based methods, including Q-learning and DQN, compute estimates of expected returns for states or actions.35 Policy-based approaches, such as REINFORCE and PPO, directly parameterize decision policies and update them through gradient feedback.36 Actor-critic algorithms, such as Asynchronous Advantage Actor-Critic (A3C) and DDPG, unify both estimation and direct policy optimization to improve stability and performance.37 Model-based RL complements these strategies by incorporating explicit traffic models to guide planning.
For continuous, high-dimensional control tasks typical in CAV scenarios, policy-gradient and actor-critic approaches (e.g., PPO and DDPG) are frequently adopted. MARL supports cooperative driving behaviors, including platooning, merging, and conflict mitigation by enabling coordinated decision-making among vehicles.38 Table 2 provides a concise overview of common RL approaches and their conceptual characteristics within CAV speed control.
Table 2.
Comparison of reinforcement learning algorithms for CAV speed control
| Algorithm | Control type | Mentioned for | Strengths (from text) | Limitations (from text) |
|---|---|---|---|---|
| Q-Learning | discrete | value-based methods for estimating returns | simplicity in structure (inferred from “value-based”) | not directly suitable for continuous control (implied) |
| DQN | discrete | high-dimensional, state-action estimation | supports large state spaces | convergence in dynamic settings not discussed |
| REINFORCE | discrete | policy-based learning | simplicity | high variance not stated, but typical limitation |
| PPO | continuous | stable learning in dynamic settings | stability, real-time viability | hyperparameter tuning implied but not detailed |
| DDPG | continuous | high-dimensional, continuous control (e.g., acceleration) | common in CAV applications | needs careful reward shaping |
| A3C | mixed | combined value-policy optimization | multi-objective handling, useful for fast training | limitations not directly mentioned |
| MARL (e.g., MADDPG) | continuous, multi-agent | multi-CAV coordination (platooning, network-level) | cooperative behaviors | scalability and communication implied as challenges |
State and action space design consideration
How states and actions are defined has a large effect on learning performance. State representations typically include kinematic variables and spacing information, while more complex setups may add congestion indicators or signal timing. Including too many variables can slow learning, while overly simple states may miss important interactions.
Actions may be defined as discrete levels or continuous signals. Discrete actions simplify learning but limit control precision. Continuous actions allow smoother behaviors but increase sensitivity to training settings. Timing also matters because some decisions require fast updates and others operate at a slower scale.
Multi-objective reward function engineering
Reward functions are used to guide learning toward desired behavior, but they are also a common source of problems. Typical objectives include improving traffic flow, maintaining safety, and avoiding uncomfortable driving behavior. These goals often conflict, which makes reward weighting difficult.
Some studies use reward shaping to speed up training, but poorly designed rewards can lead to unintended strategies. As a result, reward design remains one of the least standardized aspects of RL-based speed control.
Learning strategies and deployment considerations
Most policies are trained offline using simulation to avoid safety risks. Online learning is less common and usually constrained by safety rules. To address differences between simulation and real traffic, researchers often use transfer learning or sim-to-real techniques, though these methods are still limited in scope.
Integration framework for CAV systems
RL controllers may operate at different levels within a CAV system, ranging from individual vehicle control to coordination among groups of vehicles or network-level regulation. Integration requires attention to communication delays, computational limits, and interactions with existing control layers.
Results
This section examines how RL has been applied to dynamic speed control in CAV environments. Rather than summarizing studies one by one, it focuses on how methods have changed over time, where they have been applied, and what kinds of limitations repeatedly appear. Attention is given to patterns that affect performance, as well as areas where current approaches remain incompletely or weakly tested.
Methodological trends in RL-based speed control
Earlier work on RL-based CAV speed control relied heavily on value-based methods such as Q-learning and DQNs, particularly in VSL settings. Studies such as those of Vrbanić et al. and Ko et al.39,40 have primarily focused on demonstrating the feasibility of applying RL to speed control problems, rather than on optimizing performance under realistic conditions. These early efforts treated traffic control as a discrete decision-making task, where actions such as speed limit selection or lane control were chosen from a fixed set of options. As a result, evaluations were commonly conducted under simplified assumptions, including homogeneous traffic composition, stable demand patterns, and ideal communication between vehicles and infrastructure. While this line of work played an important role in establishing baseline performance and validating RL as a viable control framework, it provided limited insight into how such methods would scale or adapt under more complex, dynamic, and uncertain traffic environments.
More recent studies have shifted toward policy-based and actor-critic methods, including DDPG and PPO, which are better suited to continuous control problems such as acceleration and spacing. Since around 2020, MARL has become increasingly common, especially in scenarios that require coordination across multiple vehicles. Some recent work incorporates social or cooperative reward structures, reporting substantial reductions in collision rates under simulated urban conditions.41 While these results are promising, they also rely on assumptions that may not hold in real traffic, particularly with respect to communication reliability.
Algorithm choice is closely tied to the application domain. VSL studies continue to favor value-based approaches, largely because control actions are discrete and driven by infrastructure, such as selecting speed limits for predefined road segments.42 These settings align well with value-based formulations that operate over finite action spaces and centralized decision structures. In contrast, platooning and motion coordination problems typically rely on policy-based or actor-critic methods, where vehicles must continuously adjust acceleration, spacing, and relative speed in response to surrounding traffic. In these cases, continuous control is unavoidable, and policy-based frameworks offer greater flexibility in handling coupled vehicle dynamics and real-time interactions.
Compared to rule-based or model-driven controllers, RL allows speed control strategies to adapt to changing traffic conditions without requiring explicit system models. At the same time, this flexibility introduces sensitivity to training conditions and reward design. It is useful to separate the benefits that come from CAV technology itself from those that can be attributed to RL. Connectivity and automation already provide advantages such as real-time information exchange and precise control execution, which can improve traffic performance, even when conventional control methods are used. RL adds value mainly through its ability to learn control strategies from interaction, cope with delayed and competing objectives, and adapt over time without manual retuning. Evidence from studies that compare RL-based controllers with model-based approaches under the same CAV assumptions suggests that the advantages of RL are context dependent.35,43 In scenarios with higher uncertainty or changing conditions, RL often shows more robust performance. In contrast, when system dynamics are well understood and remain stable, classical controllers can achieve similar or sometimes better results.
The literature shows a clear methodological progression in RL-based speed control research. Early work largely relied on single-agent, value-based formulations designed for localized or infrastructure-centric control problems. As computational capacity increased and coordination became more central, research shifted toward policy-based and actor-critic methods capable of continuous control. More recently, MARL has emerged as the dominant paradigm, reflecting the growing recognition that speed control is inherently a coordination problem involving interacting vehicles rather than isolated decision-makers.6,44 This evolution has been driven not only by algorithmic advances but also by increasing attention to mixed-traffic conditions, scalability constraints, and deployment realism.45 Despite this progress, gaps remain in understanding how these complex frameworks perform under low penetration rates, imperfect communication, and heterogeneous driver behavior. Recent survey work further reinforces this shift toward coordinated control. For example, the work presented by Yao et al.46 provides a comprehensive review of cooperative intersection management, covering control strategies from vehicle-level trajectory planning to network-level coordination. Their analysis highlights that future performance gains increasingly depend on integration across control layers, rather than isolated algorithmic improvements Another article has begun addressing scalability and road heterogeneity in VSL control through hierarchical and heterogeneous learning structures. For example, Li et al.47 proposed a heterogeneous-agent MARL framework with curriculum learning, showing substantial reductions in training time and total travel time, but improved safety, in large-scale, multi-bottleneck freeway scenarios. Complementary results have been reported for RL-based headway control of autonomous vehicles across multiple consecutive freeway bottlenecks, where dynamically assigning section-specific headways improved traffic performance and reduced system delay by roughly 19%–22% under high-demand conditions.48
Application domains
Three application areas dominate the literature: VSLs, cooperative platooning, and speed harmonization. In the VSL domain, RL-driven strategies show improvements in network throughput and travel time, with significant scaling effects observed at higher CAV penetration levels.49 Importantly, studies consistently show that RL-based VSL integrated with CAV actuation outperforms infrastructure-only VSL deployments, especially in responsiveness and compliance.
Platooning applications demonstrate greater variability. Reported gains range from modest improvements to substantial coordination benefits. Performance differences correlate with communication assumptions, and systems assuming perfect V2V connectivity consistently show stronger results than those modeling latency or packet loss.50 This reveals a methodological gap: most published studies evaluate platooning performance under optimized communication conditions.
Speed harmonization is comparatively less mature, though findings consistently report safety benefits. Most studies have evaluated localized bottleneck relief, leaving network-wide speed coordination largely unexplored.51,52 Limited standardization in test scenarios further constrains cross-study comparisons.
Evaluation framework
Evaluation practices vary substantially. Although safety (time-to-collision and collision rates) and efficiency measures (throughput and travel time) are widely reported, environmental metrics including emissions, fuel consumption, and energy usage appear less frequently, indicating under-weighting in performance assessment.
Simulation environments lack standardization, with SUMO and VISSIM being common platforms, and numerous studies employing custom traffic simulators.53,54 Reproducibility remains a key limitation: few studies provide publicly available code or full experimental configurations. This limits comparative validity and makes replication difficult.
Real-world validation remains rare. The few studies that have compared simulation performance against real-traffic outcomes consistently report performance degradation when transitioning to physical environments, reinforcing concerns around sim-to-real transfer.55 This gap remains one of the most critical barriers to deployment. Recent work has also examined how autonomous vehicle behavior itself influences mixed traffic dynamics. Using a cellular automata framework, Wu et al.56 showed that selfish car-following and lane-changing behaviors of AVs can improve average speed and throughput at higher penetration rates, while having limited impact under dense traffic conditions. These findings highlight that behavioral assumptions about AV aggressiveness can materially affect reported performance outcomes in mixed traffic simulations.
Beyond performance degradation, many simulation-only studies rely on idealized assumptions that further limit real-world transferability. Common simplifications include perfect sensing, deterministic driver responses, and unrealistically high CAV penetration rates, which mask failure modes that emerge under real traffic variability. As a result, reported gains often reflect simulator fidelity, rather than controller robustness. The lack of standardized stress testing across communication noise, partial observability, and heterogeneous driver behavior makes it difficult to assess whether proposed RL controllers are deployment ready, reinforcing the need for more critical interpretation of simulation-based results. As a result, many reported performance gains should be interpreted as upper-bound estimates, rather than deployment-ready outcomes, particularly in studies that assume ideal sensing, perfect communication, or uniform compliance.
Research trends and maturation trajectory
Three developmental phases can be identified. Early research (2017–2019) served as validation that RL-based approaches were technically viable, commonly using simplified network conditions with fully autonomous traffic. The expansion phase (2020–2022) emphasized MARL coordination and richer state spaces, supported by rapid advancements in DeepRL and increasing CAV system maturity. Current research (2023-present) focuses on deployment readiness, particularly under mixed-traffic conditions, communication uncertainty, and safety assurance constraints.
Across phases, priorities have shifted from maximizing efficiency toward improving robustness, adaptability, and compliance with real-world constraints. Newer studies model partial penetration, irregular traffic conditions, and stochastic disturbances, reflecting progress toward realistic deployment conditions.
Research is geographically concentrated: a large share of published work originates from a limited number of regions and research communities.57 Given cultural variations in driving patterns and heterogeneous regulatory environments, this concentration poses external validity limitations for global deployment.
Critical gaps and limitations
Several gaps persist. Scalability validation is limited. Many studies have employed fewer than 50 agents despite highway-scale deployment involving thousands of vehicles. This represents a fundamental bottleneck in establishing reliability at practical scales.
Safety assurance frameworks remain underdeveloped. Although nearly all studies claim safety improvements, very few include formal safety verification, stability analysis, or evaluation under adversarial failure conditions. The lack of structured fail-safe analysis presents challenges for regulatory certification of learned controllers.
Human-in-the-loop dynamics also remain largely absent. Most studies assume idealized car-following models, and almost none have examined human acceptance, behavioral adaptation, or psychological drivers of compliance. These factors will significantly influence deployment phases dominated by mixed autonomy. Related RL work in social robotics explicitly models emotion and memory to shape user-perceived behavior quality, illustrating how affective and cognitive factors can be integrated into policy design.58
Finally, integration with existing traffic infrastructures is limited. Most RL deployments assume new infrastructure deployment, rather than interfacing with legacy signaling systems, enforcement protocols, or heterogeneous roadside communication layers. The misalignment between research environments and existing operational ecosystems constrains real-world adoption prospects.
To synthesize the methodological patterns and performance trends identified across the reviewed literature, Table 3 summarizes representative RL-based studies on speed control in connected and CAVs. The table highlights variations in algorithm selection, deployment scope, evaluation environments, and reported performance improvements, while also revealing the underlying assumptions that may limit real-world applicability. By comparing studies across different penetration rates, coordination strategies, and evaluation metrics, this synthesis provides an integrated perspective on how RL-based approaches perform under diverse traffic conditions and technical constraints. The comparative analysis further examines selected implementations in greater detail, illustrating their design choices, observed outcomes, and practical limitations for real-world deployment.
Table 3.
Comparative summary of RL-based CAV speed control and related traffic control studies
| Study | RL type used | Application domain | CAV penetration level | Environment used | Metrics evaluated | Reported outcomes | Key assumptions |
|---|---|---|---|---|---|---|---|
| Vrbanić et al.49 | Q-learning | VSL regulation | mixed traffic | traffic simulation (e.g., SUMO-type) | speed variance, flow, compliance | improved speed homogeneity and better VSL compliance vs. fixed rules | idealized sensing and actuation; limited disturbance modeling |
| Ko et al.40 | DQN | VSL for freeway safety | mixed scenarios | microscopic simulation | crash risk, lane speed variance | reduced crash surrogates and smoother speed profiles | fixed network topology; simplified driver behavior |
| Kang et al.59 | deep RL | safety-oriented VSL | 30%–70% CAV | VISSIM | crash likelihood, disturbance propagation | >50% crash-risk reduction in studied scenarios | perfect V2I execution; no communication failures |
| Rhanizar et al.42 | Q-learning variants | VSL control | multiple demand levels | simulation | travel time, compliance | reduced delay and improved VSL responsiveness | no packet loss; limited heterogeneity |
| Narasimhan et al.13 | RL (deep) | dynamic speed limit control | mixed | simulation | throughput, congestion levels | throughput gains in congested segments | stationary demand assumptions; simplified incident patterns |
| Dong et al.35 | DQN | headway/speed control | mixed autonomy | SUMO-like environment | acceleration variance, headway stability | reduced speed oscillations compared with rule-based control | no explicit noise in sensing/actuation |
| Wang et al.60 | RL with MDP formulation | highway speed regulation | mixed | simulation | time loss, travel time, stability | robust adaptive response to changing conditions | model structure known; approximated transition dynamics |
| Kusari et al.53 | deep RL | V2I-driven speed control | moderate | custom simulation platform | flow, density, delay | improved coordination at network bottlenecks | infrastructure always available and reliable |
| Al-Msari et al.54 | RL | CAV-based control (signal/speed related) | mixed | custom/commercial simulator | queues, delay, throughput | network-level performance improvements over fixed-timing or static policies | limited modeling of communication noise |
| Guo et al.24 | RL-enabled platooning/control | increasing CAV penetration | low to high | simulation | throughput, stability | efficiency gains, particularly at high CAV penetration | simple car-following models for human vehicles |
| Ghiasi et al.25 | RL-based speed advisory | 10%–100% CAV | SUMO | delay, travel time, flow | noticeable gains at medium and high penetration | perfect compliance to advisory speeds | – |
| Fu et al.26 | RL-assisted control | SWSCAV/speed warning | mixed | simulation | safety indicators, speed distributions | reduced unsafe speed deviations | assumes reliable warning reception |
| Kušić et al.27 | RL + VSL | infrastructure-based speed control | N/A or low CAV | microscopic simulation | travel time, congestion index | improvements over static VSL strategies | no explicit CAV/human interaction modeled |
| Taghavifar et al.41 | socially aware MARL | urban speed harmonization | mixed urban fleets | custom microsimulation | TTC, collision rates, interaction stability | ∼55% reduction in collisions in dense urban scenarios | fixed social behavior models; synchronized agent updates |
| Balador et al.50 | MARL | platooning speed coordination | full CAV | simulation-based evaluation | fuel use, spacing, TTC | higher cohesion and improved fuel efficiency | ideal V2V connectivity; negligible latency |
| Ha et al.51 | RL | bottleneck speed harmonization | low-medium CAV | SUMO | queue length, throughput | reduced queues and improved throughput at merge areas | limited variation in demand patterns and driver behavior |
| Yang et al.52 | RL | speed harmonization near bottlenecks | low | microsimulation | delay, flow, speed variance | throughput improvement and reduced stop-and-go waves | narrow corridor scenarios; no large-scale network effects |
| Hussain et al.55 | deep RL | real-world traffic control/validation | field deployment | real-world trials | travel time, disturbance spread | performance degradation relative to simulation; shows sim-to-real gap | high traffic randomness; partially observed human behavior |
| Irshayyid et al.32 | deep RL | real-time coordination for CAV | moderate | simulation | delay, stability, queue metrics | more consistent lanemovement vs. traditional control | limited modeling of extreme disturbances |
| Shi et al.61 | MADDPG (MARL) | multi-agent platooning | full CAV | simulation | TTC, spacing, fuel | stable platoon formations under tested conditions | no communication dropouts; ideal sensing |
| Han et al.38 | multi-agent RL | merging and platoon coordination | mixed autonomy | simulation | TTC, lane-change conflicts, fuel | safer merging and smoother integration into mainline | uniform communication reliability; stylized driver behavior |
| Fang et al.37 | actor–critic | cooperative CAV motion | multi-lane corridors | simulation | throughput, speed smoothness | smoother acceleration and fewer oscillations | reward shaping tuned manually; limited scenario diversity |
| Sharma et al.36 | PPO (policy-based) | individual CAV speed control | mixed | simulation | comfort (jerk), speed stability | improved comfort and smoother trajectories | fully observable state assumed; no sensor faults |
| Peng et al.15 | MARL | CAV–infrastructure cooperation | mixed | simulation | network throughput, delays | improved network-wide coordination vs. decentralized baselines | centralized training; assumes shared information is accurate |
| Tayab et al.14 | RL | disturbance-aware velocity tracking | mixed | simulation | tracking error, recovery time | better disturbance rejection than classical controllers | simplified disturbance models; ideal actuation |
| Ma & He62 | RL (route/traffic choice) | route optimization with speed component | mixed | simulation | travel time, path efficiency | reduced travel times via adaptive routing | static network; limited incident modeling |
| Zhou et al.63 | RL | intersection/speed coordination | mixed | simulation | delay, queue length | lower intersection delays. cinoared ti fixed-time control | – |
This pattern points to a trade-off between algorithm complexity and experimental realism. Studies that use more advanced learning architectures often depend on stronger assumptions, particularly regarding reliable communication and uniform traffic behavior. As a result, improvements are frequently demonstrated under conditions that are easier than those expected in real deployments. Reducing this gap will likely require shifting evaluation practices toward scenarios that include degraded communication, heterogeneous traffic, and other non-ideal conditions, even if reported performance gains are small.
Discussion
This section examines a set of representative implementations to illustrate how RL-based speed control is actually designed and tested in CAV environments. The cases are used to show different modeling choices, control assumptions, and evaluation practices, rather than to suggest a single best-performing approach. Together, they provide concrete examples of how RL controllers behave under different traffic conditions and deployment settings.
Case selection criteria
The case studies discussed here were chosen to reflect a range of implementation strategies and evaluation setups that are relevant to deployment-oriented research. Selection emphasized studies that introduced non-trivial control designs, reported results using multiple performance metrics, and considered traffic scenarios beyond fully autonomous conditions. Preference was also given to work that documented communication assumptions and provided sufficient methodological detail to allow interpretation of results. While the cases are not exhaustive, they capture the main application areas currently explored in RL-based speed control and span a range of CAV penetration levels.
VSL control
Case study 1: Multi-agent dynamic zone placement
Vrbanić et al. proposed a VSL strategy,64 in which control zones are not fixed in space but instead shift in response to observed traffic conditions. Their approach relies on speed gradients estimated from CAV data, treating vehicles as mobile sensors, rather than depending on predefined roadside locations. A Q-learning controller is used to decide both where speed limit zones should be placed and what speeds should be assigned.
Unlike traditional VSL systems that operate at a single location, this formulation allows multiple zones to be active simultaneously. Simulation results showed reductions in total time spent (TTS) across all tested penetration rates, including modest but consistent improvements at low CAV adoption levels. Because the approach does not rely on physical signage, it was presented as scalable to long highway segments with limited infrastructure.
Performance evaluation demonstrated consistent improvements across all tested penetration ratios, with a notable 7.6% reduction in TTS at 30% CAV penetration. Importantly, this benefit remained observable, even at low penetration (10%), indicating resilience under early-stage deployment conditions. The approach also eliminated reliance on physical display systems by using CAVs as both sensing and actuation agents, making the strategy scalable to long and uninstrumented highway stretches.
Case study 2: Differential multi-agent VSL control
Han et al. developed a multi-agent VSL strategy,65 in which each lane is controlled independently. Rather than applying uniform speed limits across all lanes, the system adjusts speeds based on lane-level traffic conditions. Training is centralized to capture interactions between lanes, while execution is decentralized.
Across mixed-traffic experiments, MARL-DVSLC (differential variable speed limit control) provided significant improvements over prior approaches. Relative to DDPG-based DVSLC, the system reduced TTS by 12.88% under stable demand; compared with traditional feedback-based uniform VSLC, improvements reached 21.34%. Moreover, congestion propagation duration decreased by 64.2%, and the spatial congestion footprint reduced by 54.7%, demonstrating meaningful operational benefits.
The reported results indicate improvements over both traditional uniform VSL systems and earlier RL-based approaches, particularly in terms of congestion duration and spatial extent. The study suggests that lane-level differentiation can help reduce the buildup of disturbances, although the approach assumes reliable communication and detailed lane-specific measurements.
Platooning and cooperative control
Case study 3: Federated learning for multi-vehicle coordination
Several recent studies have explored federated RL as a way to coordinate multiple vehicles without centralized data collection.66,67 In these setups, vehicles train local policies, using onboard data and periodically share model parameters. Zeng et al.66 reported faster convergence compared with those in earlier federated methods, as well as stable speed tracking under changing demand.
Within platooning scenarios, Ameur et al.67 showed that federated learning can support coordinated behavior without requiring full data sharing. Related work also demonstrated stabilization effects by using decentralized learning with limited or no explicit V2V communication. These approaches appear promising in situations where communication quality varies, although they remain largely evaluated in simulation.
Taken together, these studies indicate that federated RL may offer a practical option for scaling control strategies while limiting data sharing. This appears particularly relevant in early deployment stages, where vehicle participation is uneven, and reliable direct communication between all agents cannot be assumed.
Case study 4: Stop-and-go wave damping with a single CAV
Jiang et al.68 investigated whether a single CAV68 could reduce stop-and-go waves in mixed traffic. The controller was trained using real trajectory data and tested with one automated vehicle embedded among human-driven vehicles. The results demonstrated substantial dampening of oscillatory patterns, including 54% reduction in speed fluctuations for the controlled CAV and 8%–28% improvement among following vehicles. Fuel-consumption outcomes followed similar improvement trends. These findings established that a strategically positioned CAV can serve as a traffic stabilizer, even at limited market penetration rates, providing a compelling early-deployment use case.
Similar findings have been reported in related studies: model-free controllers achieved near-total wave dissipation with only ∼10% CAV adoption,69 and modular deep RL architectures yielded up to 57% improvement in system-wide velocities with approximately 4%–7% adoption.70 Together, these studies position wave dampening as a comparatively mature and immediately deployable application domain for RL-enabled CAV systems.
Speed harmonization
Case study 5: Lane-differentiated speed harmonization
Hua and Fan8 examined a speed harmonization approach based on DDPG8 that allows different speed targets to be applied across lanes, rather than enforcing a single uniform limit. The controller was tested under varying traffic densities and a range of conditions, including adverse weather. The results showed that the approach improved both safety and mobility when a moderate share of vehicles were connected and automated. At a penetration rate of 50%, reported outcomes included a 17.1% reduction in collision probability, a 9.62% decrease in average travel time, and a 15.16% reduction in overall time loss compared to scenarios without control.
A key feature of the method is its lane-specific treatment of traffic. Higher speeds were maintained in overtaking lanes, typically in the range of 65–70 mph, while neighboring lanes were slowed as needed. This asymmetric adjustment helped limit the formation of stop-and-go waves near merge areas and prevented speeds from dropping too sharply, even under heavy congestion. The results suggest that treating lanes differently can be more effective than uniform harmonization, particularly in mixed-traffic settings.
Case study 6: Integrated merge control and harmonization
Ko et al.71 proposed a dual-network DQN setup in which two separate controllers were used; one controlled handled lane-merging decisions, while the other adjusted vehicle speeds upstream of downstream bottlenecks.71 Rather than treating merging and speed control as a single task, the two components operated in parallel and influenced traffic behavior in different ways. In simulation, the approach led to higher throughput and lower fuel consumption than conventional late-merge strategies, with reported improvements of about 30% and 20%, respectively.
The reported benefits were strongest when traffic demand changed quickly or when only a moderate fraction of vehicles followed the learned policies. This suggests that the method can be effective, even when CAV penetration is limited. The results also indicate that noticeable network-level improvements can be achieved by actively controlling only a subset of vehicles and do not require full participation across the traffic stream.
Real-world implementations
Case study 7: Online RL for trajectory tracking
Köpf et al.71 implemented an online RL controller for longitudinal speed tracking in a production vehicle environment. The controller adapted its policy in real time and compensated for partial observability, using signal reconstruction techniques. Compared with fixed controllers, the RL-based approach showed more consistent tracking behavior.
This study is notable because it moved beyond simulation and demonstrated feasibility in an operational setting, although the scope of control was limited.
Case study 8: Model-based policy search for real-vehicle control
Puccetti et al.72 introduced an ARX-based model-learning layer wherein RL controllers operated on learned system proxies, rather than direct state histories.72 This structure allowed stable adaptation across a wide range of exploration noise profiles and demonstrated competitive performance, even at low speed ranges typical of complex urban environments.
Relative to model-free baseline, the ARX-based formulation improved resilience to partial observability and system delays, representing a structured pathway for bridging model-based safety assurances with RL-based autonomy.
Observations across cases
Several patterns appear repeatedly across these implementations. Discrete decision problems, such as zone placement or merge control, tend to use Q-learning variants, while continuous control problems rely more on actor-critic methods. Applications such as wave damping and merge coordination appear closer to practical use because they require relatively low CAV penetration. Federated and large-scale coordination strategies remain in early stage. More specifically, wave damping and single-CAV stabilization (4%–10% penetration, no infrastructure changes) are nearest to deployment; VSL implementations require V2I coordination and moderate penetration; and large-scale MARL and federated platooning depend on conditions unlikely within the next decade.
Safety benefits are commonly reported, but few studies include explicit safety verification or certification-oriented analysis. Finally, many results suggest that substantial benefits can be achieved at moderate penetration levels, often around 30%–50%, indicating that full automation is not a prerequisite for meaningful traffic improvements.
However, the reported performance range (7%–57%) cannot support meaningful cross-study comparison, as apparent differences may reflect scenario difficulty and baseline definitions as much as genuine algorithmic capability. Establishing standardized benchmarking frameworks with common baselines and evaluation protocols remains essential for credible progress assessment. Table 4 presents a summary of RL-based CAV speed control case studies.
Table 4.
Summary of RL-based CAV speed control case studies
| Study | Application domain | RL algorithm | Key innovation | Performance metrics | Test environment | CAV penetration |
|---|---|---|---|---|---|---|
| Tayab et al.14 | adaptive car-following | RL with EDE | dynamic EDE gain adjustment | 50% velocity error reduction; smoother acceleration | simulation (single/multi-vehicle) | single CAV scenarios |
| Kang et al.59 | active speed management | DQN | time-to-collision in reward function | 53% safety improvement; 59% traffic density improvement | mixed traffic simulation | 50% |
| Jiang et al.68 | stop-and-go wave dampening | deep RL | longitudinal control optimization | 54% speed oscillation reduction (CAV); 8–28% (HDV), fuel savings | mixed traffic simulation | single CAV placement |
| Vrbanić et al.39 | VSL control | Q-Learning with 2-step TD | CAVs as actuators | improved TTT and MTT | seven traffic scenarios | 10%–100% |
| Vrbanić et al.64 | dynamic VSL | congestion detection Q-Learning | gradient-based congestion detection, dynamic positioning | 7.6% TTS reduction, better than rule-based VSL | simulation: six traffic scenarios | 10%–100% (optimal at 30%) |
| Han et al.65 | multi-lane VSL | MARL with MADDPG | differential speed limits per lane | 12.88% TTS reduction vs. DDPG-DVSLC, 21.34% vs. uniform VSL | freeway bottleneck simulation | 30% penetration rate |
| Menegatti et al.79 | non-cooperative platooning | DRL | no V2V communication required | secure spacing; efficient flow | real-world scenarios | not specified |
| Zhang et al.73 | large-Scale VSL | multi-agent RL | parameter sharing, scalability | reduced spatial speed variations | large corridors | varying |
| Borneo et al.10 | platoon control | RL-based CACC | adaptive cruise control | improved stability and adaptability | not specified | mixed traffic |
| Ameur et al.67 | distributed platooning | federated DRL | privacy-preserving learning | enhanced adaptability and performance | platoon simulation | not specified |
| Liu et al.45 | speed planning | multi-light DRL | traffic light integration | 6.79% fuel savings vs. single-light | not specified | not specified |
| Ko et al.40 | speed harmonization and merge | dual DQN | integrated merge control | 30% throughput increase; 20% fuel reduction | highway lane closures | varying CAV rates |
| Taghavifar et al.41 | behaviorally aware navigation | SARSA with SVO | social preferences integration | 55.6% collision risk reduction; 2.1× reward improvement | not specified | not specified |
| Muzahid et al.80 | chain collision prevention | DRL (actor-critic) | perception network | improved safety efficiency | unity 3D simulation | not specified |
| Köpf et al.71 | real vehicle speed tracking | online RL | time-varying trajectory, FIR filter | outperformed traditional controllers | real vehicles | N/A |
| Puccetti et al.72 | real vehicle control | model-based RL with ARX | ARX model for partial observability | comparable to model-free RL, wider noise tolerance | real vehicles | N/A |
| Gregurić et al.81 | VSL strategies | DDPG | connected vehicle optimization | higher throughput; minimal braking; increased headway | connected vehicle env. | not specified |
| Lin et al.43 | adaptive cruise control | DDPG vs. MPC | comparative analysis | DRL 5.8% above optimal; 17.2% OOD | not specified | not specified |
| Wei et al.82 | network-level control | PPO | mixed-autonomy optimization | 1.3× better than baseline | network-level | mixed |
| Hua & Fan8 | dynamic speed harmonization | DDPG | safety-oriented DSH with differential lane control | 17.1% collision reduction; 9.62% travel time improvement | mixed freeway with incidents | 10%–90% (optimal at 50%) |
| Zeng et al.66 | CAV control | dynamic federated proximal (DFP) | federated learning for CAVs | 40% faster convergence; accurate speed tracking | real vehicular data traces | varying participation |
| Kreidieh et al.69 | wave dissipation | model-free RL | open road network control | near complete wave dissipation at 10% penetration | ring road simulation | 2.5%–10% |
| Wu et al.70 | multi-scenario traffic control | modular deep RL | modular framework for diverse scenarios | up to 57% velocity improvement | various network topologies | 4%–7% |
Conclusion
Building on the literature reviewed, this section synthesizes recurring challenges and proposes grounded research directions that reflect both technical realities and practical deployment constraints. Several strategic research directions emerge as foundational for maturing RL-based CAV speed control from experimental prototypes into deployable mobility infrastructure. These directions span technical scalability, methodological refinements, real-world validation, and regulatory alignment, reflecting both current research trajectories and the persistent gaps that limit deployment readiness. This need for cross-layer coordination aligns with recent reviews of cooperative intersection management that emphasize integration across design, control, and deployment considerations.46
Critical technical research priorities
Scalability of MARL remains the most pressing challenge. Existing approaches, including those reported by Han et al. and Zhang et al.,65,73 demonstrate strong performance in controlled, multi-agent settings; however, their computational and interaction overheads grow disproportionately with fleet size. The parameter-sharing mechanism proposed by Zhang et al.73 presents an important step toward scalable coordination, yet fundamental questions remain regarding convergence guarantees, localized adaptability, and performance under heterogeneous agent behavior. Future research must focus on hierarchical multi-level architectures where agents coordinate locally yet maintain global traffic-optimal strategies, supported by scalable execution architectures designed for real-time operation.
Robust learning under uncertainty represents a second critical priority. The domain mismatch between simulated training environments and deployment scenarios, identified in studies,74,75 continues to result in sharp performance degradation during field trials. Emerging evolutionary and diversity-preserving strategies76 demonstrate promising resilience yet lack structured integration within adaptive traffic control architectures. Furthermore, early physics-infused learning strategies61 illustrate the value of embedding system-dynamics knowledge into learning structure, suggesting a broader opportunity to integrate traffic theory, human-driver behavior modeling, and roadway geometry constraints into policy learning.
Safety-constrained learning forms the third major priority area. As highlighted by Wang et al.60 existing RL policies cannot be certified using conventional safety verification processes due to their dynamic and opaque decision boundaries. Lyapunov-based critic networks proposed by Jiang et al.68 offer preliminary stability-oriented guarantees, but broader frameworks are not yet operationalizable at scale. Forward deployment strategies will require integration of constraint-based learning layers, falsification-oriented validation frameworks, and runtime monitors capable of overruling unsafe decisions.
To move beyond conceptual formulations, these research directions can be interpreted in terms of incremental deployment pathways. For example, scalable MARL architectures are most likely to emerge first in geographically constrained settings such as freeway corridors, managed lanes, or bottleneck zones, where agent interactions are localized and communication requirements are bounded. Hierarchical designs, in which local coordination is handled at the corridor or segment level while higher-level objectives are managed centrally, represent a practical intermediate step toward network-wide coordination. Similarly, safety-constrained learning is more immediately applicable to advisory or speed harmonization roles where learning-based controllers operate within conservative bounds, rather than with full autonomy.
The reviewed studies show that RL-based speed control in CAV environments is constrained by several cross-cutting limitations. Multi-agent investigations indicate that coordination complexity and communication burden grow sharply with fleet size and heterogeneity, which still prevents validation at realistic highway-scale agent counts.65,73,77 At the same time, controllers trained in idealized simulation environments frequently degrade when exposed to real traffic, exhibiting overly conservative or unstable behavior and weak generalization across demand and density patterns.74,75,78 Mixed-traffic analyses that explicitly model driver stochasticity and connectivity uncertainty further suggest that human behavioral variability remains a dominant source of instability and energy loss.63 Taken together, these factors imply that current performance figures should primarily be interpreted as proof-of-feasibility upper bounds, rather than deployment-ready outcomes, reinforcing the need for scalable, safety-assured training and validation pipelines before large-scale rollout.
Methodological innovations required
Reward design continues to be a difficult issue in RL-based traffic control. Several studies including those of Hua and Fan, Ko et al., and Liu et al.8,40,45 have shown that results are highly sensitive to how safety, efficiency, and energy-related objectives are weighted. In some cases, policies perform well under one setting but degrade noticeably when rewards are delayed or indirectly linked to actions.75 This sensitivity suggests that manual reward tuning is often insufficient. More systematic approaches, such as learning rewards from historical control decisions or reusing reward structures across similar road layouts, may help reduce this dependence on hand-crafted design.
State representation is another area where current approaches appear limited. Recent work using graph-based interaction models highlights the value of capturing spatial relationships between vehicles.7 At the same time, traffic systems operate across multiple scales. Local interactions at the lane level can trigger effects that propagate across entire corridor, and most existing representations do not capture both levels well. Extending current models to better reflect these multi-scale interactions may improve robustness, particularly when traffic patterns differ from those seen during training.
Hybrid learning approaches also warrant further attention. The work presented by Lin et al.43 showed that RL controllers can achieve performance similar to model predictive control under certain conditions, but with lower online computation. This raises the possibility of combining model-based and learning-based methods, rather than treating them as alternatives. In addition, using imitation or supervised pretraining based on existing traffic controllers or expert behavior may help reduce training time and limit unsafe exploration during early deployment stages.
Real-world validation and deployment framework
A structured validation pipeline remains critical to transitioning beyond simulation environments. Current practices vary widely across studies, complicating scientific comparison and policy adoption. The real-vehicle implementations in the studies by Köpf et al. and Puccetti et al.71,72 illustrate promising early precedents, yet they do not provide a generalizable evaluation methodology. Systematic evaluation must include
-
•
multi-density load conditions,
-
•
degraded sensing states,
-
•
communication delays and loss,
-
•
adverse weather conditions, and
-
•
mixed autonomy compositions.
Progressive deployment architectures, beginning with digital-twin-validated training, followed by controlled track evaluations, then corridor-level deployment under supervision, appear necessary to mitigate operational risk.
Infrastructure-supported deployment will further require robust integration with cloud-edge execution stacks, real-time data synchronization pipelines, and safety certification monitors. The federated coordination frameworks introduced in Ameur et al.’s work67 provide a feasible direction for scaling learning across larger fleets while preserving privacy and resilience, though deployment models need explicit mechanisms for rollback, override, and version control of models.
Policy and regulatory research priorities
Regulatory frameworks for adaptive learning systems remain significantly underdeveloped. As noted in a study,18 certification pathways were designed primarily for deterministic controllers and offer no provisions for evolving policies. Research must define
-
•
safety envelope definitions for learning controllers,
-
•
frameworks for continuous post-deployment auditing,
-
•
accountability allocation when system behavior emerges from adaptive learning, and
-
•
minimum evidence thresholds enabling regulatory clearance.
Beyond safety, important socio-technical considerations arise. Automated speed control directly influences fuel expenditure, travel delays, and priority allocation in congestion. At scale, such systems become inherently distributive and therefore subject to fairness and accessibility concerns. Research must examine public trust, behavioral adaptation of human drivers, decision transparency in automated interventions, and the governance of algorithm-based roadway authority.
Strategic research roadmap
These directions point toward a gradual, staged path, rather than immediate large-scale deployment. In the near term, roughly over the next few years, work is likely to focus on improving the scalability of multi-agent learning, incorporating safety constraints more explicitly into RL policies, and developing evaluation benchmarks that allow more consistent comparison across studies. Over a longer horizon, attention will need to shift toward controlled pilot deployments, validation across different regions and traffic contexts, and closer alignment with regulatory and certification processes. Broader integration at the level of urban networks, including coordination across different transport modes and infrastructure systems, remains a longer-term objective.
From this review of studies, we believe that the progress along this path will depend on sustained collaboration across multiple fields, including RL, transportation engineering, embedded control, regulatory research, and human-centered system design. Addressing these challenges incrementally, rather than through isolated technical advances, will be necessary if RL-based speed control is to be moved beyond experimental settings. With sufficient validation and realistic expectations, such approaches could eventually contribute to smoother traffic flow, improved safety outcomes, and more energy-efficient operation in connected vehicle networks.
From a deployment perspective, the timeline for RL-based CAV speed control is likely to be staged, rather than uniform. In the near term, applications such as stop-and-go wave damping, speed advisory systems, and localized merge control are the most feasible, as they require low penetration rates and limited infrastructure integration. Over the medium term, corridor-level coordination using MARL may become viable as communication reliability improves and mixed-traffic modeling matures. Full network-level, learning-based traffic control, particularly under heterogeneous penetration and uncertain human behavior, remains a longer-term objective that depends on advances in scalability, verification, and regulatory acceptance.
The reviewed studies show that RL can support dynamic speed control in CAV settings, but the results are not uniform. Many RL-based approaches report improvements in safety, traffic flow, and energy-related measures when compared with conventional control strategies. Reported gains vary widely, with safety improvements ranging from roughly 8% to over 50%, traffic-flow measures improving by about 7%–57%, fuel consumption reductions between 6% and 20%, and throughput increases in the order of 12%–30%. These outcomes, however, are highly sensitive to factors such as CAV penetration levels, traffic conditions, and how performance is evaluated.
Across the literature, multi-agent approaches often based on DDPG or PPO tend to perform more consistently than single-agent methods, particularly at moderate penetration levels around 30%–50%. This suggests that meaningful benefits may be achievable without full automation. At the same time, reported performance often depends on controlled assumptions, and results are less consistent when conditions deviate from those used during training.
Methodologically, there has been a clear shift from early value-based methods toward policy-based and multi-agent formulations. Applications such as VSL control and platooning appear relatively mature in simulation studies. In contrast, real-world validation remains limited. Only a small fraction of studies include experiments outside simulation, and studies that do generally report weaker performance than expected. This points to an ongoing gap between simulated environments and real traffic, influenced by simplified vehicle models, communication uncertainty, human driving behavior, and environmental variability.
Several issues continue to limit near-term deployment. Scalability in multi-agent learning remains largely untested at realistic traffic scales. Safety assurance is another major concern, with only a minority of studies attempting formal verification or certification-oriented analysis. Generalization across different traffic settings is also weak, as many controllers are tuned to specific environments and require retraining when conditions change. Given these limitations, gradual deployment in controlled test environments appears necessary, supported by explicit safety mechanisms and careful monitoring of performance loss.
Taken as a whole, the combination of RL, vehicle connectivity, and increasing automation offers clear potential for improving traffic robustness and efficiency. At the same time, expectations around performance transfer should remain realistic. Progress toward deployment will depend less on further gains in simulation and more on evidence gathered under real operating conditions, along with clearer reporting of assumptions and limitations. With continued methodological refinement and cautious validation, RL-based control can move closer to practical use within future CAV-enabled traffic systems.
Acknowledgments
The research was financially supported by the Deanship of Scientific Research (DSR) at the University of Tabuk, Tabuk, Saudi Arabia, under grant no. 0181-1442-S. In addition, brochure article is derived from a research grant funded by the Research, Development, and Innovation Authority (RDIA)-Kingdom of Saudi Arabia (grant no. 13010-Tabuk-2023-UT-R-3-1-SE).
Author contributions
T.A. and F.AN. conceived the study and led the research design and methodology; T.A. performed the primary analysis and drafted the original manuscript; S.A., M.A., and M.M. supported data collection and validation; F.A. and M.G. contributed to technical support, result verification, and critical manuscript review. All authors participated in the discussion and interpretation of results and manuscript revision and approved the final manuscript.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Tareq Alhmiedat, Email: t.alhmiedat@ut.edu.sa.
Fady Alnajjar, Email: fady.alnajjar@uaeu.ac.ae.
References
- 1.Texas A&M Transportation Institute Urban Mobility Report. 2023. https://mobility.tamu.edu/umr/
- 2.Thorne E., Chong Ling E., Phillips W. Naciones Unidas Comisión Económica para América Latina y el Caribe (CEPAL); 2024. Assessment of the Economic Costs of Vehicle Traffic Congestion in the Car-Ibbean: A Case Study of Trinidad and Tobago. [Google Scholar]
- 3.Sokido D.L. Measuring the level of urban traffic congestion for sustainable transportation in Addis Ababa, Ethiopia, the cases of selected intersections. Front. Sustain. Cities. 2024;6 [Google Scholar]
- 4.Mukherjee A., Anwaruzzaman A.K.M. Gridlock gloom: A geographical analysis of commuters’ perceptions on traffic congestion. International Journal of Human Capital in Urban Management. 2024;9:617–636. [Google Scholar]
- 5.Sitati C. A street-level assessment of greenhouse gas emissions associated with traffic congestion in the city of Nairobi, Kenya. 2024. [DOI]
- 6.Zhang S., Shi J., Huang Y., Shen H., He K., Chen H. Investigating the effect of dynamic traffic distribution on network-wide traffic emissions: An empirical study in Ningbo, China. PLoS One. 2024;19 doi: 10.1371/journal.pone.0305481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang J., Wang P., Golpayegani F., Shen W. DVS-RG: Differential Variable Speed Limits Control using Deep Reinforcement Learning with Graph State Representation. arXiv. 2024 doi: 10.48550/arxiv.2405.09163. Preprint at. [DOI] [Google Scholar]
- 8.Hua C., Fan W.D. Safety-Oriented Dynamic Speed Harmonization of Mixed Traffic Flow in Nonrecurrent Congestion. Statistical Mechanics and its Applications. 2024;634 doi: 10.1016/j.physa.2023.129439. [DOI] [Google Scholar]
- 9.Jiang Z., Wang Y., Wang J., Fu X. Evaluation of Connected and Autonomous Vehicles for Congestion Mitigation: An Approach Based on the Congestion Patterns of Road Networks. J. Transp. Eng. Part A: Systems. 2024;150 doi: 10.1061/jtepbs.teeng-8121. [DOI] [Google Scholar]
- 10.Borneo A., Zerbato L., Miretti F., Tota A., Galvagno E., Misul D.A. Platooning Cooperative Adaptive Cruise Control for Dynamic Performance and Energy Saving: A Comparative Study of Linear Quadratic and Reinforcement Learning-Based Controllers. Applied Sciences. 2023;13:10459. doi: 10.3390/app131810459. [DOI] [Google Scholar]
- 11.Zhou M., Yu Y., Qu X. Development of an efficient driving strategy for connected and automated vehicles at signalized intersections: A reinforcement learning approach. IEEE trans. Intell. Transp. Syst. 2020;21:433–443. [Google Scholar]
- 12.Ha P.Y.J., Chen S., Dong J., Du R., Li Y., Labi S. Leveraging the Capabilities of Connected and Autonomous Vehicles and Multi-Agent Reinforcement Learning to Mitigate Highway Bottleneck Congestion. arXiv. 2020 doi: 10.48550/arXiv.2010.05436. Preprint at. [DOI] [Google Scholar]
- 13.Narasimhan N., Mall N., Rajeev A.M., Shanmugasundaram A.K., Tumuluru V.K. Traffic Congestion Management Using Deep Reinforcement Learning and Decentralized Routing. 2024. [DOI]
- 14.Tayab A., Li Y., Syed A. Reinforcement Learning-Based Approach to Reduce Velocity Error in Car-Following for Autonomous Connected Vehicles. Machines. 2024;12:861. doi: 10.3390/machines12120861. [DOI] [Google Scholar]
- 15.Peng X., Gao H., Wang H., Zhang H.M. Combat Urban Congestion via Collaboration: Heterogeneous GNN-based MARL for Coordinated Platooning and Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems; 2025. [Google Scholar]
- 16.Hasan A., Chakraborty N., Chen H., Cho J.-H., Wu C., Driggs-Campbell K. Cooperative Advisory Residual Policies for Congestion Mitigation. J. Autonom. Transport. Syst. 2024;2:1–31. [Google Scholar]
- 17.Dogo E.M., Makaba T., Afolabi O.J., Ajibo A. Springer; Cham: 2021. Combating Road Traffic Congestion with Big Data: A Bibliometric Review and Analysis of Scientific Research. [DOI] [Google Scholar]
- 18.Sebastian A.M., Athulram K.R., Michael C., Sunil D.A., Preetha K.G. Enhancing Traffic Control Strategies through Dynamic Simulation and Reinforcement Learning. 2024. [DOI]
- 19.Daza I.G., Izquierdo R., Martínez L.M., Benderius O., Llorca D.F. Sim-to-real transfer and reality gap modeling in model predictive control for autonomous driving. Appl. Intell. 2023;53:12719–12735. [Google Scholar]
- 20.Amini M.H., Nejati S. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. IEEE; 2024. Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems; pp. 732–744. [Google Scholar]
- 21.Gokasar I., Timurogullari A., Deveci M., Garg H. SWSCAV: Real-time traffic management using connected autonomous vehicles. ISA Trans. 2023;132:24–38. doi: 10.1016/j.isatra.2022.06.025. [DOI] [PubMed] [Google Scholar]
- 22.Huang M., Jiang Z.P., Ozbay K. IEEE; 2020. Learning-Based Adaptive Optimal Control for Connected Vehicles in Mixed Traffic: Robustness to Driver Reaction Time. [DOI] [PubMed] [Google Scholar]
- 23.Wu X., Yue W., Sha Z., Feng Y. IEEE; 2024. CAV as a Mobile Control Platform: A Paradigm for Traffic Management on Highways. [DOI] [Google Scholar]
- 24.Guo Y., Ma J., Leslie E., Huang Z. Evaluating the Effectiveness of Integrated Connected Automated Vehicle Applications Applied to Freeway Managed Lanes. IEEE trans. Intell. Transp. Syst. 2022;23:522–536. doi: 10.1109/TITS.2020.3012678. [DOI] [Google Scholar]
- 25.Ghiasi A., Li X., Ma J., Ma J. A mixed traffic speed harmonization model with connected autonomous vehicles. Transport. Res. C Emerg. Technol. 2019;104:210–233. doi: 10.1016/J.TRC.2019.05.005. [DOI] [Google Scholar]
- 26.Fu Z., Kreidieh A.R., Wang H., Lee J.W., Monache M.L.D., Bayen A.M. IEEE; 2023. Cooperative Driving for Speed Harmonization in Mixed-Traffic Environments. [DOI] [Google Scholar]
- 27.Kušić K., Ivanjko E., Gregurić M., Miletić M. An overview of reinforcement learning methods for variable speed limit control. Applied Sciences. 2020;10:4917. [Google Scholar]
- 28.Gao X., Li X., Liu Q., Li Z., Yang F., Luan T. Multi-agent decision-making modes in uncertain interactive traffic scenarios via graph convolution-based deep reinforcement learning. Sensors. 2022;22:4586. doi: 10.3390/s22124586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Khamis M.A., Gomaa W. Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework. Eng. Appl. Artif. Intell. 2014;29:134–151. [Google Scholar]
- 30.Rivera Cardoso A., Wang H., Xu H. Large scale Markov decision processes with changing rewards. Adv. Neural Inform. Proc. Syst. 2019;32 [Google Scholar]
- 31.Shi H., Zhou Y., Wu K., Wang X., Lin Y., Ran B., Ran B. Connected automated vehicle cooperative control with a deep reinforcement learning approach in a mixed traffic environment. Transport. Res. C Emerg. Technol. 2021;133 doi: 10.1016/J.TRC.2021.103421. [DOI] [Google Scholar]
- 32.Irshayyid A., Chen J., Xiong G. A review on reinforcement learning-based highway autonomous vehicle control. Green Energy Intell. Transp. 2024;3 [Google Scholar]
- 33.Li L., Gan J., Qu X., Lu W., Mao P., Ran B. A Dynamic Control Method for CAVs Platoon Based on the MPC Framework and Safety Potential Field Model. KSCE J. Civ. Eng. 2021;25:1874–1886. doi: 10.1007/S12205-021-1585-5. [DOI] [Google Scholar]
- 34.Wang X., Li F., Tang Y., Peng X., Ni J. IEEE; 2024. A Deep Reinforcement Learning Approach for Optimized Speed Planning of Connected and Autonomous Vehicles. [Google Scholar]
- 35.Dong J., Chen S., Ha P.Y.J., Li Y., Labi S. A DRL-based multiagent cooperative control framework for CAV networks: A graphic convolution Q network. arXiv. 2020 doi: 10.48550/arXiv.2010.05437. Preprint at. [DOI] [Google Scholar]
- 36.Sharma R., Garg P. 2024 5th International Conference on Smart Electronics and Communication (ICOSEC) IEEE; 2024. Optimizing Autonomous Driving with Advanced Reinforcement Learning: Evaluating DQN and PPO; pp. 910–914. [Google Scholar]
- 37.Fang S., Yang L., Shang W.L., Zhao X., Li F., Ochieng W. IEEE; 2025. Cooperative Control Model Using Reinforcement Learning for Connected and Automated Vehicles and Traffic Signal Light at Signalized Intersections. [Google Scholar]
- 38.Han S., Zhou S., Wang J., Pepin L., Ding C., Fu J., Miao F. A multi-agent reinforcement learning approach for safe and efficient behavior planning of connected autonomous vehicles. IEEE trans. Intell. Transp. Syst. 2024;25:3654–3670. [Google Scholar]
- 39.Vrbanić F., Ivanjko E., Mandzuka S., Miletic M. IEEE; 2021. Reinforcement Learning Based Variable Speed Limit Control for Mixed Traffic Flows. [DOI] [Google Scholar]
- 40.Ko B., Ryu S., Park B.B., Son S.H. Speed harmonisation and merge control using connected automated vehicles on a highway lane closure: a reinforcement learning approach. IET Intell. Transp. Syst. 2020;14:947–957. doi: 10.1049/IET-ITS.2019.0709. [DOI] [Google Scholar]
- 41.Taghavifar H., Hu C., Wei C., Mohammadzadeh A., Zhang C. Behaviorally-aware Multi-Agent RL with Dynamic Optimization for Autonomous Driving. IEEE Trans. Autom. Sci. Eng. 2025;22:10672–10683. doi: 10.1109/tase.2025.3527327. [DOI] [Google Scholar]
- 42.Rhanizar A., El Akkaoui Z. Modern Artificial Intelligence and Data Science 2024. Tools, Techniques and Systems; 2024. A Survey About Learning-Based Variable Speed Limit Control Strategies: RL, DRL and MARL; pp. 565–580. [Google Scholar]
- 43.Lin Y., McPhee J., Azad N.L. IEEE; 2019. Longitudinal Dynamic versus Kinematic Models for Car-Following Control Using Deep Reinforcement Learning; pp. 1504–1510. [DOI] [Google Scholar]
- 44.Han Y., Wang M., Leclercq L. Leveraging reinforcement learning for dynamic traffic control: A survey and challenges for field implementation. Communications in Transportation Research. 2023;3 [Google Scholar]
- 45.Liu B., Sun C., Wang B., Sun F. Adaptive Speed Planning of Connected and Automated Vehicles Using Multi-Light Trained Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2022;71:3533–3546. doi: 10.1109/TVT.2021.3134372. [DOI] [Google Scholar]
- 46.Yao Z., Zhao Y., Jiang H., Jiang Y. IEEE; 2025. A Review of Cooperative Intersection: From Design to Management. [Google Scholar]
- 47.Li Z., Chen S., Xiao G., Jiang Y., Yao Z., Yang P. A heterogeneous agent reinforcement learning approach with curriculum learning for variable speed limit control. Expert Syst. Appl. 2026;299 [Google Scholar]
- 48.Elmorshedy L., Smirnov I., Abdulhai B. Freeway congestion management on multiple consecutive bottlenecks with RL-based headway control of autonomous vehicles. IET Intell. Transp. Syst. 2024;18:1137–1163. [Google Scholar]
- 49.Vrbanić F., Ivanjko E., Kušić K., Čakija D. Variable speed limit and ramp metering for mixed traffic flows: A review and open questions. Applied Sciences. 2021;11:2574. [Google Scholar]
- 50.Balador A., Bazzi A., Hernandez-Jayo U., de la Iglesia I., Ahmadvand H. A survey on vehicular communication for cooperative truck platooning application. Vehicular Communications. 2022;35 [Google Scholar]
- 51.Ha P.Y.J., Chen S., Dong J., Labi S. Leveraging vehicle connectivity and autonomy for highway bottleneck congestion mitigation using reinforcement learning. Transportmetrica: Transport Science. 2025;21 [Google Scholar]
- 52.Yang H., Rakha H. Feedback control speed harmonization algorithm: Methodology and preliminary testing. Transport. Res. C Emerg. Technol. 2017;81:209–226. [Google Scholar]
- 53.Kusari A., Li P., Yang H., Punshi N., Rasulis M., Bogard S., LeBlanc D.J. 2022 ieee intelligent vehicles symposium (IV) IEEE; 2022. Enhancing SUMO simulator for simulation based testing and validation of autonomous vehicles; pp. 829–835. [Google Scholar]
- 54.Al-Msari H., Koting S., Ahmed A.N., El-Shafie A. Review of driving-behaviour simulation: VISSIM and artificial intelligence approach. Heliyon. 2024;10 doi: 10.1016/j.heliyon.2024.e25936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hussain Q., Alhajyaseen W.K.M., Pirdavani A., Reinolsmann N., Brijs K., Brijs T. Speed perception and actual speed in a driving simulator and real-world: A validation study. Transport. Res. F Traffic Psychol. Behav. 2019;62:637–650. [Google Scholar]
- 56.Wu Y., Li L., Jiang C., Jiang Y., Yao Z. The impact of selfish driving behavior of autonomous vehicles on mixed traffic flow. Phys. Stat. Mech. Appl. 2025;672 [Google Scholar]
- 57.Lee S.C., Stojmenova K., Sodnik J., Schroeter R., Shin J., Jeon M. Proceedings of the 11th international conference on automotive user interfaces and interactive vehicular applications: Adjunct proceedings. 2019. Localization vs. internationalization: research and practice on autonomous vehicles across different cultures; pp. 7–12. [Google Scholar]
- 58.Ahmad M.I., Gao Y., Alnajjar F., Shahid S., Mubin O. Emotion and memory model for social robots: a reinforcement learning based behaviour selection. Behav. Inf. Technol. 2022;41:3210–3236. [Google Scholar]
- 59.Kang K., Park N., Park J., Abdel-Aty M. Computer-Aided Civil and Infrastructure Engineering; 2024. Deep Q-network Learning-based Active Speed Management under Autonomous Driving Environments. [DOI] [Google Scholar]
- 60.Wang Z., Hong T. Reinforcement learning for building controls: The opportunities and challenges. Appl. Energy. 2020;269 [Google Scholar]
- 61.Shi H., Chen D., Zheng N., Wang X., Zhou Y., Ran B. A deep reinforcement learning based distributed control strategy for connected automated vehicles in mixed traffic platoon. Transport. Res. C Emerg. Technol. 2023;148 doi: 10.1016/j.trc.2023.104019. [DOI] [Google Scholar]
- 62.Ma X., He X. Providing real-time en-route suggestions to CAVs for congestion mitigation: A two-way deep reinforcement learning approach. Transp. Res. Part B Methodol. 2024;189 doi: 10.1016/j.trb.2024.103014. [DOI] [Google Scholar]
- 63.Zhou J., Yan L., Liang J., Yang K. Enforcing cooperative safety for reinforcement learning-based mixed-autonomy platoon control. IEEE trans. Intell. Transp. Syst. 2026;27:1592–1605. [Google Scholar]
- 64.Vrbanić F., Gregurić M., Miletic M., Ivanjko E. Reinforcement Learning-Based Dynamic Zone Positions for Mixed Traffic Flow Variable Speed Limit Control with Congestion Detection. Machines. 2023;11 doi: 10.3390/machines11121058. [DOI] [Google Scholar]
- 65.Han L., Zhang L., Guo W. Multi-Agent Deep Reinforcement Learning for Multi-Lane Freeways Differential Variable Speed Limit Control in Mixed Traffic Environment. Transp. Res. Rec. 2024;2678:749–763. doi: 10.1177/03611981241230524. [DOI] [Google Scholar]
- 66.Zeng T., Semiari O., Chen M., Saad W., Bennis M. Federated learning on the road autonomous controller design for connected and autonomous vehicles. IEEE Trans. Wirel. Commun. 2022;21:10407–10423. [Google Scholar]
- 67.Ameur M.E.A., Drias H., Brik B., Ameur M. Leveraging Transfer Learning with Federated DRL for Autonomous Vehicles Platooning. 2024. [DOI]
- 68.Jiang L., Xie Y., Wen X., Chen D., Li T., Evans N.G. 2021 7th International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS) IEEE; 2021. Dampen the stop-and-go traffic with connected and automated vehicles–a deep reinforcement learning approach; pp. 1–6. [Google Scholar]
- 69.Kreidieh A.R., Wu C., Bayen A.M. 2018 21st international conference on intelligent transportation systems (itsc) IEEE; 2018. Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning; pp. 1475–1480. [Google Scholar]
- 70.Wu C., Kreidieh A.R., Parvate K., Vinitsky E., Bayen A.M. Flow: A modular learning framework for mixed autonomy traffic. IEEE Trans. Robot. 2022;38:1270–1286. [Google Scholar]
- 71.Köpf F., Puccetti L., Rathgeber C., Hohmann S. International Conference on Intelligent Transportation Systems; 2020. Reinforcement Learning for Speed Control with Feedforward to Track Velocity Profiles in a Real Vehicle. [DOI] [Google Scholar]
- 72.Puccetti L., Yasser A., Rathgeber C., Becker A., Hohmann S. IEEE Intelligent Vehicles Symposium; 2021. Speed Tracking Control Using Model-Based Reinforcement Learning in a Real Vehicle; pp. 1213–1219. [DOI] [Google Scholar]
- 73.Zhang Y., Quiñones-Grueiro M., Barbour W., Zhang Z., Scherer J., Biswas G., Work D.B. Cooperative Multi-Agent Reinforcement Learning for Large Scale Variable Speed Limit Control. 2023. [DOI]
- 74.Zhang Y., Zhang Z., Quiñones-Grueiro M., Barbour W., Weston C., Biswas G., Work D.B. Field Deployment of Multi-Agent Reinforcement Learning Based Variable Speed Limit Controllers. arXiv. 2024 doi: 10.48550/arxiv.2407.08021. Preprint at. [DOI] [Google Scholar]
- 75.Feng J., Shi T., Wu Y., Xie X., He H., Tan H. Multi-Lane Differential Variable Speed Limit Control via Deep Neural Networks Optimized by an Adaptive Evolutionary Strategy. Sensors. 2023;23 doi: 10.3390/s23104659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Feng J., Lin K., Shi T., Wu Y., Wang Y., Zhang H., Tan H. Cooperative traffic optimization with multi-agent reinforcement learning and evolutionary strategy: Bridging the gap between micro and macro traffic control. Statistical Mechanics and its Applications. 2024;647 doi: 10.1016/j.physa.2024.129734. [DOI] [Google Scholar]
- 77.Hua M., Chen D., Qi X., Jiang K., Liu Z., Zhou Q., Xu H. Multi-Agent Reinforcement Learning for Connected and Automated Vehicles Control: Recent Advancements and Future Prospects. arXiv. 2023 doi: 10.48550/arxiv.2312.11084. Preprint at. [DOI] [Google Scholar]
- 78.Zhang K., Cui Z., Ma W. A survey on reinforcement learning-based control for signalized intersections with connected automated vehicles. Transp. Rev. 2024;44:1187–1208. [Google Scholar]
- 79.Menegatti D., Wrona A., Paola A.D., Gentile S., Giuseppi A. Deep Reinforcement Learning Platooning Control of Non-Cooperative Autonomous Vehicles in a Mixed Traffic Environment. 2024. [DOI]
- 80.Muzahid A.J.M., Kamarulzaman S.F., Rahman M.A., Alenezi A.H. Deep reinforcement learning-based driving strategy for avoidance of chain collisions and its safety efficiency analysis in autonomous vehicles. IEEE Access. 2022;10:43303–43319. [Google Scholar]
- 81.Gregurić M., Kušić K., Ivanjko E. Impact of deep reinforcement learning on variable speed limit strategies in connected vehicles environments. Eng. Appl. Artif. Intell. 2022;112 [Google Scholar]
- 82.Wei H., Liu X., Mashayekhy L., Decker K. 2019 IEEE Vehicular Networking Conference (VNC) IEEE; 2019. Mixed-autonomy traffic control with proximal policy optimization; pp. 1–8. [Google Scholar]

