Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Sep 25;15:32862. doi: 10.1038/s41598-025-18449-1

Hierarchical reinforcement learning-based traffic signal control

Jiajing Shen 1,
PMCID: PMC12464201  PMID: 40998910

Abstract

Efficient traffic light control is a critical issue in urban transportation systems. Recently, deep reinforcement learning (DRL) has gained popularity as a method for real-time traffic light control. However, state-of-the-art DRL-based systems typically construct individual RL agents for each intersection, each focusing on local objectives, potentially undermining overall global traffic efficiency. To overcome this limitation, we propose SHLight (Sample selection-based Hierarchical traffic Light control method), an innovative RL method featuring a hierarchical framework comprising a manager and multiple workers. Specifically, we partition the traffic network into multiple regions, assigning a manager agent to oversee each region, with traffic signal controllers acting as workers. The manager sets goals for the workers based on the overarching global objective, empowering the workers to achieve these goals while concurrently optimizing local traffic efficiency. Our method introduces several key innovations: it employs importance sampling to mitigate the non-stationarity problem inherent in hierarchical reinforcement learning, and it incorporates auxiliary actions to enhance the observability of the DRL model. Our extensive simulations conducted on both synthetic and real-world road networks demonstrate that SHLight outperforms the state-of-the-art models, showcasing reduced queue lengths, waiting times, and delays.

Keywords: Intelligent traffic signal control, Hierarchical deep reinforcement learning, Multi-agent reinforcement learning

Subject terms: Computer science, Information technology

Introduction

Traffic congestion in urban areas is on a continual rise, emphasizing the critical need for efficient traffic management systems1,2. The effective management of traffic congestion offers a multitude of advantages. Three of the main benefits include alleviating congestion, reducing the probability of the occurrence of traffic accidents, and enabling efficient and reasonable route planning. While modern sensing technologies have enabled the ubiquitous deployment of adaptive traffic signals, critical challenges persist in developing coordination mechanisms that can simultaneously optimize local responsiveness and global traffic fluidity. This raises the pivotal question of how to leverage traffic signals efficiently to enhance traffic signal management decisions, thereby mitigating traffic congestion and improving traffic flow management.

Intelligent traffic signal control systems (TSCS) offer promising solutions for optimizing traffic flow, reducing travel time, and minimizing traffic jams by employing effective signal control and coordination. However, existing TSCS predominantly rely on pre-defined fixed-time plans, which often result in inefficient traffic management. While some works have formulated traffic light control as optimization problems and developed various algorithms, these algorithms typically assume uniform arrival rates and unlimited lane capacity, which does not align with practical scenarios. Owing to the complex nature of traffic systems and traffic dynamics, an effective traffic signal control policy in reducing traffic congestion needs to be capable of being adaptive to real-time traffic flows and traffic sensory data. Therefore, it is necessary to develop efficient control methods to better solve this real-world challenge.

Recently, Deep reinforcement learning (DRL)-based TSCS have been introduced and have gained popularity for real-time traffic light control35. In a typical DRL-based TSCS, such as Intellilight6, the environment consists of signal phases and traffic conditions, with the state serving as a feature representation of the environment. An agent utilizes the state as the input to learn a decision-making model, choosing an action from a binary action space: “keep” or “change” the current phase of traffic signals. This action updates the environment, and the agent receives a reward, such as the number of vehicles passing through the intersection, to evaluate the effectiveness of the action. However, existing methods often develop agents based on local objectives, which may fall short of achieving global optimization.

Setting an accurate goal that enables the optimization of traffic flow beyond the confines of the local environment poses a significant challenge, as previously discussed. To address this issue, hierarchical reinforcement learning (HRL) has been employed710. HRL breaks down the entire task by building a two-level framework. In this framework, the upper-level policy is called manager and the lower-level policy is called worker. Specifically, the manager generates sub-goals aimed at achieving global targets over an extended time horizon. Meanwhile, the workers execute actions to fulfill these sub-goals while addressing their specific local objectives. Drawing inspiration from this concept, we introduce a hierarchical structure tailored for cooperative signal control.

In this paper, we propose SHLight, a novel HRL approach designed for cooperative TSC. Initially, we partition the entire traffic network into multiple regions overseen by a manager agent, with each intersection within a region managed by a worker agent. Managers across different regions and workers within the region collaborate closely to enhance traffic efficiency. At each intersection, the local goal is to improve traffic efficiency, while the global goal focuses on enhancing regional traffic efficiency. Through the implementation of a hierarchical structure, SHLight effectively coordinates different intersections, concurrently optimizing both global and local traffic efficiency, thus improving the traffic efficiency of the entire road network. Each manager agent collaborates with other managers to set goals for its workers, who then strive to achieve these goals while enhancing local traffic efficiency. This hierarchical approach ensures a harmonious synchronization of the workers’ local targets with the manager’s global target, leading to improved overall traffic management.

Extensive experiments have been conducted on both synthetic and real-world networks to assess the performance of our method. The results demonstrate that the proposed approach effectively optimizes global traffic management and surpasses state-of-the-art RL methods. The main contributions of this study can be summarized as follows:

  • We develop a novel HRL-based TSCS that decomposes the complex task in a top-down manner, with the manager focusing on global objectives and providing cooperative goals, while the workers strive to achieve these goals alongside their local objectives.

  • We propose a dual actor-critic fusion mechanism to optimize the global goals provided by the manager for each region, which enhances the coordination across different intersections in various regions.

  • We design a sample selection method to ensure the consistency of the regional goal and the intersection state, so that the control of the intersection can more effectively optimize the overall regional traffic flow.

  • We evaluate SHLight on both synthetic and real-world road networks, and our extensive simulation results validate that SHLight outperforms the compared methods, leading to superior performance.

The remaining parts of this paper are organized as follows. Section Related Work generally reviews some related work on traffic signal control. Section Preliminaries and Problem Statement defines the problem setting of the traffic signal control problem, and Section Solution Detail gives the details of our framework SHLight. Section Performance Evaluation is the experimental results where we evaluate the performance of the proposed method. Finally, in Section Conclusion, we present our conclusions and indicate some future directions.

Related work

Traditional traffic signal control

Traditional traffic signal control methods are derived from traffic control theory3. The Webster method2,11 and SCAT12 alter the phase time intervals between neighboring intersections whose cycle length is assumed to be the same. However, they are insensitive to real-time traffic changes13 since their control policy is based on historical traffic movements. The max-pressure approach6 minimizes the “pressure” (over-saturation on certain lanes) of the phases for an intersection, but it assumes that downstream lanes have unlimited capacity, which is impractical. None of these methods adapt to changing traffic conditions.

Learning-based traffic signal control

In contrast to traditional approaches, deep reinforcement learning (DRL)-based traffic signal control does not require pre-defined traffic signal plans and can be learned directly from intersections in the real world3,14. DRL-based methods treat each intersection as an agent, use a quantitative representation of traffic conditions as state, and traffic signal choice as action.

For instance, IntelliLight4 optimizes signal light control for a single intersection using the feature representation of current crossroad traffic conditions as input and recommends a phase as output. Previous work5 uses a graph convolution network (GCN) to represent the lane structure of an intersection, and work15 refines the phase decision of each intersection using RL models with a continuous phase duration output. The recent work16 introduce a safe reinforcement learning-based fully adaptive multimodal traffic signal controller to resolves public transport priority request conflicts and ensures traffic safety. However, all these methods focus on optimizing a single intersection without considering communication between intersections in the road network.

Multi-agent-based TSCS Some studies have exploited multi-agent DRL (MADRL) to control traffic lights at multiple intersections within a region. For example, the previous works of5,11,17,18 and19 proposed TSCS that share real-time traffic information of intersections within a particular region. Traffic information refers to the traffic flow, waiting time, queue length, and signal color, etc. Because the agents in these systems are provided the traffic conditions of other intersections as part of the state input, they coordinate better with their counterparts at neighboring intersections. In order to be able to adapt to large-scale road networks, the work20 controls traffic signals in a distributed manner and enhances collaboration among agents by introducing GCN and self-attention mechanisms. And for heterogeneous intersection, the work21 introduce a value-decomposition-based spatiotemporal graph attention multi-agent DRL model that expressly accommodates intersection heterogeneity. However, these MARL methods lack sufficient coordination mechanisms among agents with flat learning, resulting in suboptimal performance.

HRL-based TSCS Recent studies have shown that HRL can effectively address complex tasks by decomposing them into simpler sub-tasks in a top-down manner. For instance, FMA2C10 proposes a novel MARL approach combined with the feudal hierarchy. However, since the work uses an on-policy approach, there is no guarantee that the optimal policy can be found. In contrast, HALight22 employs a hierarchical structure consisting of a manager and several workers to improve traffic efficiency on arterial roads. Nevertheless, its hierarchical structure introduces non-stationarity, which can hinder its performance. Additionally, another important problem with the above work is that the upper-level agents only make decisions based on macroscopic regional information, and thus the obtained regional optimization goal may not be optimal. MAHPPO23 uses information-constrained primitives (ICP)24 to achieve hierarchical control, therefore enabling more targeted strategy learning. To facilitate policy transferability, it adopts a shared neural network module across all agents. However, the training of the general model makes the control of the signal light cannot reach the optimal.

Preliminaries and problem statement

In this section, we first simply introduce preliminaries, then establish the key definitions and formulate the problem statement.

Preliminaries

Multi-agent reinforcement learning (MARL) In general, RL focuses on the single agent and solves the tasks corresponding to a single agent. However, many problems in the real world, such as traffic signal control25 and express package distribution26, contain multiple agents and such problems usually could not be solved independently. The solutions to these problems frequently are a combination of the partial solutions for their sub-problems. We call them “multi-agent learning problems27”. The agents in MARL employ cooperative learning by taking other agents’ policies and observations as a part of the input itself rather than explicitly communicating among agents. MADDPG extends DDPG into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic by the observations and actions of all agents. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After the MADDPG model is well-trained, only the local actors are used during the execution phase and act in a decentralized manner.

Hierarchical RL (HRL) HRL decomposes complex tasks into multiple sub-tasks in a hierarchical manner, enhancing the decision-making capacity of the overall model. Typically, HRL involves the division of decision-making into two layers, where each upper-layer manager supervises multiple lower-layer workers to establish hierarchical control28. The manager’s role encompasses setting long-term global objectives and offering goals or options to the workers. Subsequently, the workers execute actions within the environment to accomplish these goals8,29. In this study, we implement a goal-based hierarchical RL approach for traffic signal control.

Definitions

Definition 1

(Road Network) A road network is represented as a graph Inline graphic, where V denotes the set of intersection vertices and E represents the set of road edges. Each road edge Inline graphic connects two distinct intersection vertices Inline graphic and Inline graphic, where Inline graphic. Each road edge comprises at least one lane with a specific driving direction, which can be left-turn, straight, right-turn, or mixed.

Definition 2

(Phases) A phase refers to a valid combination of traffic signals (green/red) for the lanes at an intersection.

Figure 1 depicts a standard four-way intersection, consisting of four entrance roads (gray) and four departure roads (white), each with three lanes: left-turn, straight, and right-turn. The 12 lanes are numbered from 1 to 12 in a clockwise fashion. The intersection also involves four directions: north (N), south (S), west (W), and east (E). The traffic signals control straight movements (E-W, W-E, S-N, and N-S), left turns (E-N, W-S, N-W, and S-E), and right turns (E-S, W-N, N-E, and S-W). We illustrate phase examples as follows. Using lane numbers (1...12) and colors (either red or green) for the signal lights, we identify a phase as ’rrrggrrrrggr’, indicating that vehicles can move in the directions of W-N, W-E, E-W, and E-S.

Fig. 1.

Fig. 1

A standard four-way intersection.

By deploying sensors such as road side units (RSUs) and cameras at road intersections, we can collect real time traffic data for every lane, including the number of vehicles entering the lane. Our objective is to develop an MADRL-based traffic light control and coordination framework that maximizes the overall traffic flow efficiency at a set of intersections. The framework aims to process raw traffic data, extract meaningful features, and make informed decisions regarding traffic signals to optimize traffic flow.

Problem 1

Consider a road network Inline graphic with n regular intersections Inline graphic. The goal is to develop a traffic light control policy that optimizes traffic flow by minimizing queue lengths, waiting times, and delays, thereby improving the overall traffic condition of the intersection network efficiently and effectively.

Solution detail

In this section, we provide a detailed exposition of the SHLight framework. In order to achieve the local goals of the intersection and the global goals of the region at the same time, we introduce a hierarchical structure, where a region is controlled by a manager, and each intersection is controlled by a worker. This hierarchical design enables the manager to set coordinated goals that prioritize regional traffic efficiency, while fostering collaboration among workers to achieve both regional goals and their individual local targets. Subsequent subsections provide a detailed breakdown of the SHLight framework, including an analysis of the challenges, an overview of the hierarchical architecture, and a thorough explanation of each component’s role and functionality.

Challenges

The hierarchical control structure has been employed in previous HRL-based approaches to avoid getting stuck in local optima. However, these approaches are hindered by several challenges that limit their effectiveness.

Non-stationary

Since the lower-level policy is continuously evolving, experiences collected for a particular high-level action in the past may not generalize to the same low-level behavior in the future. Consequently, this renders the experiences less valid for training, leading to a non-stationary problem for the higher-level policy.

Partial observability

While distributing global control to local RL agents improves scalability, it introduces partial observability due to limited communication among agents. This means that each local agent has only a partial view of the environment, which increases the difficulty of learning and decision-making.

Solution overview

To address the challenges outlined above, we propose an HRL-based framework called SHLight, as depicted in Fig. 2. This framework is tailored for a regional road network consisting of n intersections, with each region partitioned using METIS30. In each region, our hierarchical structure comprises a manager and n workers, each corresponding to an intersection. The architecture of our method is represented in Fig. 2.

Fig. 2.

Fig. 2

Framework of SHLight hierarchy.

The manager assumes the responsibility of setting regional targets and operates at a higher level. At intervals of every T time steps, it obtains the goal according to the feature Inline graphic that reflects the traffic conditions in the region, and improves the regional traffic efficiency through achieving this goal. The global goal is then fed to the corresponding workers and remains constant until the subsequent decision cycle. It is noteworthy that the manager abstains from direct involvement with the regional environment; instead, it receives a global reward from the environment to refine its policy.

The worker assumes the responsibility of achieving the designated global goal while enhancing its local traffic efficiency, including metrics such as queue length. After completing the previous stage, it takes an action based on its local observation and the goal. Since the manager executes every T time steps, the goal that the worker receives from the manager is updated every T time steps.

SHLight employs MADDPG at both manager and worker levels, structured under a Centralized Training with Decentralized Execution (CTDE) framework to realize the cooperation across different regions and intersections. The detailed designs of the manager and workers are described in the following sub-sections.

For challenge 1, the introduction of the sample selection module allows each manager to select samples that match the current environment as much as possible. For challenge 2, we include the neighboring features extracted by the neighboring feature extraction module in the state of worker and proposed a dual actor-critic network structure of manager, so that each agent has more information regarding the regional traffic distribution and cooperative strategy.

Hierarchy

Manager

To enhance the observability of the MADDPG network for each manager, we propose a dual actor-critic network structure, that is, we add an additional enhanced actor-critic network to the basic actor-critic network. The basic actor-critic network is used for macro-control of the region, while the enhanced actor-critic network is used for micro-control of the region. The architecture of the manager is illustrated in Fig. 3. We begin by elaborating on the state, action, and reward settings for the basic actor-critic network.

Fig. 3.

Fig. 3

Manager network structure.

State space The state is represented as Inline graphic, where each component Inline graphic (with Inline graphic) denotes the traffic wave – a measure of the total number of vehicles on the incoming lane within a 50m range from the intersection – in the corresponding direction (north, east, south, or west).

Action Space Each manager’s local action corresponds to a potential traffic flow pattern. Four possible configurations of north-south and east-west traffic flows (i.e., north to south, south to north, east to west, and west to east) are examined.

Reward For each manager M, the local reward is defined as Inline graphic. Here, Inline graphic signifies the number of vehicles arriving at the destination across all lanes within the region during a specific time frame, while Inline graphic represents the liquidity of traffic flow at all intersections within the region over a designated period. To account for the varied impact of each intersection’s traffic status on the surrounding area, we incorporate an attention mechanism into the regional features, allowing them to focus on the most critical intersections within the region.

In the micro-control part, we obtain fine-grained features from the region, which is first fed into an attention layer. By incorporating the attention layer, the output Inline graphic is calculated as a weighted sum of the intersection features based on the attention weight, as follows:

graphic file with name d33e605.gif 1

where Inline graphic represents the features (queue length, waiting time, number of vehicles) of all intersections. The functions Inline graphic denote the embedding layers for queries, keys, and values, respectively, with Inline graphic being the parameters of the corresponding network layers.

The output Inline graphic can be further computed as:

graphic file with name d33e638.gif 2

where Inline graphic is a scaling factor. Subsequently, we feed the output of the attention network into the region-specific model, which is composed of FC layers, to learn the region-specific feature. To enhance policy observability, we design a general model that shares neural networks across all regions, allowing it to learn a general feature. This architecture enables our model to effectively utilize traffic flow information even when traffic density is low. Then, we concatenate the outputs of the region-specific model and the general model as the input to the actor-critic’ network, and produce the action’. As for the action and reward, they are the same as the basic actor-critic network.

In order to find a balance between macro- and micro-control, we concatenate the two actor-critic states and feed them into an autoencoder, and then normalize the resulting embedding as the weights of the two actions. The loss function of the autoencoder Inline graphic is as follows:

graphic file with name d33e659.gif 3

where n is the number of workers in region M, Inline graphic represents the reward of manger and Inline graphic represents the reward of worker i, which will be given later on. Here, Inline graphic ensures that the autoencoder prioritizes embeddings z that correlate with higher hierarchical rewards. The embedding of the autoencoder Inline graphic is normalized to represent the relative importance of macro- and micro-control actions. The weights Inline graphic are derived by minimizing Inline graphic, which jointly optimizes reconstruction error (Inline graphic) and the cumulative rewards of both manager and workers (Inline graphic). Finally, we get the goal Inline graphic of manager M by the following function:

graphic file with name d33e737.gif 4

where a and Inline graphic are the actions of two actor-critic, respectively. Thus, Inline graphic inherently balances global and local objectives by weighting actions proportionally to their contribution to overall efficiency.

Worker

For each worker, the specific DRL elements are defined as follows:

State Space The state is represented as Inline graphic, where Inline graphic is an array comprising the queue length L, waiting time W, and number of vehicles V at the central node, and Inline graphic represents its 1-hop neighbors, which is obtained by the neighboring feature extraction module.

Action Space For each worker, the agent determines an action by selecting a discrete number Inline graphic and a continuous number Inline graphic, where Inline graphic represents the phase index, and Inline graphic is the duration of the subsequent phase.

Reward For each worker i, the local reward is defined as a weighted sum of several factors that capture traffic congestion and trip delay:

graphic file with name d33e831.gif

To achieve efficient traffic light control, we carefully designed the reward function to balance the following factors:

  • D: sum of delays for all vehicles entering the intersection, where delay Inline graphic for lane i is defined by Inline graphic6.

  • L: queue length of all lanes entering the intersection.

  • W: sum of the waiting times for all vehicles in the entering lanes of the intersection.

  • N: number of vehicles passing through an intersection during the phase.

  • T: average travel time of all vehicles passing through an intersection during the phase.

  • S: average speed of all vehicles in the entering lanes.

By incorporating these factors, our reward function aims to optimize traffic light control and minimize congestion. In order to preserve the interpretability of the reward function, instead of dimensionality reduction of the reward function, we directly use linear weighting, which explicitly reflects the priority of each objective through linear weights.

Sample selection

For the non-stationarity of the HRL, the goal in the past sample will deviate from the real environment. To solve this problem, we choose samples according to their importance, which is affected by the sampled times, M-W distance, and M-M distance. Figure 4 shows the process of sample selection. In this module, we feed each sample in the manager replay buffer into an MLP network and the output of the network is the sample importance. Before we calculate the sample importance, we need to get sample prediction first. Specifically, we use an autoencoder to get a sample embedding vector Inline graphic for each manager. As for workers within the region, we use FC layers to get a mapping from the samples of workers to the sample embeddings of related managers, denoted as Inline graphic. Then, we feed the sample embedding vectors from time step t to Inline graphic of workers within the region where manager M belongs to, represented as Inline graphic, into the LSTM to predict Inline graphic.

Fig. 4.

Fig. 4

Process of sample selection.

After obtaining the sample prediction, the label for sample i can be computed using the function f, demonstrated as follows:

graphic file with name d33e1041.gif 5

where Inline graphic is the number of sampling iterations, Inline graphic is the normalized Euclidean distance of the sample embedding vector for manager M, which is calculated by the function Inline graphic, and Inline graphic is the normalized distance between sample i and the real environment. To calculate Inline graphic, we first need to obtain the target state, which represents the extremes of traffic flow change. Table 1 shows the mapping relationship between the goal and the target state. For instance, in Fig. 4, the goal is to generate a traffic flow from north to south, which means only the light in the gray lane is green. Suppose the state of the gray lane is Inline graphic while the other lanes are Inline graphic. To obtain the state of each lane, we consider the features of the number of vehicles, waiting time, and queue length. Therefore, we have Inline graphic, Inline graphicInline graphic, and the target state is represented as Inline graphic according to the clockwise lane-indexed rule (shown in Fig. 1).

Table 1.

Goal-target state mapping.

Goal Target state
Inline graphic [Inline graphic]
Inline graphic [Inline graphic]
Inline graphic [Inline graphic]
Inline graphic [Inline graphic]

Next, we calculate the sample bias, which represents the difference between the samples in the replay buffer and the real environment. The sample bias is calculated as

graphic file with name d33e1133.gif 6

where i represents the i-th intersection, Inline graphic represents the real next state, and Inline graphic represents the next state in the replay buffer. Thus, we can obtain the normalized sample distance Inline graphic by function Inline graphic, which is shown as

graphic file with name d33e1171.gif 7

After obtaining the sample importance, the sampling probability for each sample is calculated by normalization, which is shown as

graphic file with name d33e1178.gif 8

where i represents the i-th sample, and we select samples with the maximum probability for training.

Neighboring feature extraction

To enhance the interconnection between upstream and downstream intersections, we devised a network structure called STNet, as depicted in Fig. 5. STNet takes a graph G as input, which represents the central node along with its 1-hop neighbors. This graph G undergoes processing via a GCN network to extract spatial topology information. Simultaneously, the node feature, formed by concatenating the features of 5 nodes within the graph G, is passed through an FC layer to derive node feature embeddings. The resulting outputs from the GCN and FC layers are concatenated to form a comprehensive feature vector Inline graphic. This vector is then fed into a GRU network to capture temporal information, yielding a final neighboring feature representation. By integrating GCN and GRU, STNet effectively captures both spatial topology information and vertical time series patterns, effectively modeling the trend of traffic flow at each intersection. Moreover, the combination of FC layers and GRU enables the model to capture horizontal time series patterns, reflecting the trend of traffic flow between intersections, thereby facilitating a comprehensive understanding of traffic dynamics.

Fig. 5.

Fig. 5

STNet structure.

Offline training

Algorithm 1.

Algorithm 1

SHLight Traffic Signal Control

Algorithm 2.

Algorithm 2

SampleSelection(SS)

In this section, we describe the multi-agent offline training method employed in our framework. Since the agents require global information from other junctions during training, but only local information is needed during execution, we opted for an offline multi-agent training approach.

For each manger, it produces an action Inline graphic based on its policy Inline graphic, and Inline graphic represents the parameters of the network. The Q values are approximated as Inline graphic in macro-control, and is approximated as Inline graphic, where Inline graphic and Inline graphic represent parameters of the two critics in manager. Therefore, the policy gradient for macro-control can be formed as:

graphic file with name d33e1295.gif 9

The policy gradient for micro-control is similar to that of macro-control, so we do not give its formula here.

Each worker inputs the state Inline graphic into the actor network. Unlike the model proposed by31, which outputs a phase every 5 s, our redesigned actor network outputs both the current phase and transition duration before entering the subsequent phase. Thus it enables continuous signal control, allowing for more efficient management of traffic flow and improved traffic distribution. The policy gradient is:

graphic file with name d33e1314.gif 10

Finally, the pseudo-code for Alg. 1 is presented below. This algorithm initializes the parameters of Inline graphic for the manager macro-control policy Inline graphic, Inline graphic for the manager micro-control policy Inline graphic and Inline graphic for the worker policies Inline graphic. Subsequently, within the two for loops, the first stage (Lines 3-17) uses the sampled actions produced by the actor network Inline graphic to interact with the environment, resulting in a collection of experiences stored in the replay buffer. In the second stage (Lines 18-30), the joint training of the hierarchical structure is performed to optimize the regional traffic efficiency. Specifically, the SS() function is detailed in Alg. 2.

Performance evaluation

In this section, we describe the preliminary experiments conducted to evaluate the performance of the proposed HRL-based TSCS using the widely used traffic simulator SUMO, which simulates traffic on both a synthetic road network and two real-world road networks. We evaluated the performance of SHLight across various road network scales using the following metrics: (1) queue length, (2) waiting time, and (3) delay. Additionally, we conducted a comprehensive quantitative evaluation and compared its performance with that of other approaches.

Datasets

Synthetic Data Figure 6 shows the synthetic road network and Table 2 lists its configurations employed in our experiments. The network comprised 25 nodes, all of which were four-way intersections with varying traffic densities. Additionally, as shown in Table 2, the synthesized traffic flow included both peak and smooth traffic flows. The peak flow originated from Inline graphic and terminated at Inline graphic, traversing the road network with fluctuating traffic density over time, thereby mimicking realistic traffic. Additionally, Configurations 1-3 correspond to the traffic generated for the peak traffic densities of 400, 900, and 1800, respectively. The directions in Table 2 indicate the start and end points of the flow, whereas density represents the vehicle generation rate, with higher values indicating increased vehicle production at the source, which was generated using a Poisson distribution with distinct arrival rates. Finally, the start and end times contribute to the diverse traffic densities.

Fig. 6.

Fig. 6

Synthetic road network.

Table 2.

Synthetic flow.

Config Directions Density (cars/hour) Start time End time
1 OD1 - OD10 200 0 600
400 600 3000
OD19 -OD9 200 0 3000
OD17 - OD7 200 0 3000
OD5 - OD15 200 0 3000
OD3 - OD13 200 0 3000
2 OD1 - OD10 200 0 600
400 600 1200
900 1200 3000
OD19 -OD9 200 0 3000
OD17 - OD7 200 0 3000
OD5 - OD15 200 0 3000
OD3 - OD13 200 0 3000
3 OD1 - OD10 200 0 600
400 600 1200
900 1200 1800
1400 1800 3000
OD19 -OD9 200 0 3000
OD17 - OD7 200 0 3000
OD5 - OD15 200 0 3000
OD3 - OD13 200 0 3000

Real-world Data We also use the real-world traffic data from two cities: Shanghai, China and Ingolstadt, Germany. The road networks for both cities were imported from OpenStreetMap (Fig. 7) and converted into simulation-ready formats using SUMO’s NetConvert tool. The detailed descriptions of the two datasets are as follows:

  • Wujiaochang We deployed SHLight across 26 intersections in this densely populated urban area. The traffic flow data was derived from surveillance camera footage, capturing dynamic congestion patterns from 6:00 to 23:00 on October 12, 2021 (Fig. 8). This dataset reflects realistic traffic density variations, including peak-hour congestion and smooth-flow periods, enabling robust online evaluation of our model.

  • Ingolstadt We verified SHLight on 19 intersections within the InTAS scenario, a widely adopted SUMO benchmark for urban traffic studies32. This scenario features a congested downtown zone with multiple signalized intersections, providing a standardized environment to assess our method’s performance under well-documented traffic conditions.

Fig. 7.

Fig. 7

Real road networks.

Fig. 8.

Fig. 8

Real traffic flow of Wujiaochang on weekdays. (Download from http://www.jtcx.sh.cn/trafficindex.html).

Baselines and metrics

We selected the following baseline models for our comparative experiments:

  • FT: Fixed time (FT) control is a primary traffic signal control method wherein the signal cycle and duration of each phase are predefined, e.g., 30 s.

  • DQN: Deep Q-learning (DQN) is a value-based RL method that determines whether to change the phase every 5 seconds.

  • DDPG: Deep deterministic policy gradient (DDPG) is a policy-based RL method that outputs the duration of each phase in a pre-defined cycle.

  • FMA2C: Feudal multi-agent advantage actor-critic (FMA2C) is an extension of MA2C33 with a feudal hierarchy, wherein the duration of each phase is fixed.

  • MAHPPO: Multi-agent hierarchical proximal policy optimization (MAHPPO) establishes a hierarchical structure through sub-task decomposition for zero-shot transfer.

Additionally, We used the following metrics for the performance evaluations:

  • Inline graphic: average queue length over time, where the queue length at time t is denoted as L in Subsection Worker. A smaller queue length indicates fewer waiting vehicles in all lanes.

  • Inline graphic: average waiting time over time, where the waiting time at time t is denoted as W (defined in Subsection Worker).

  • Delay: average delay over time, where the delay at time t is denoted as D in Subsection Worker. A lower delay indicates higher speeds for all lanes.

Experiment settings

The environment is a simulated four-way intersection, with each road having six lanes: three for entering and three for leaving the intersection. Table 3 lists the main parameters used in the simulation. The duration is set to a continuous number between 30 and 90. The speed limitation of all lanes in the configuration is 20 m/s and the speed limitation of vehicles is set to 15 m/s. Detectors are placed at the in-lanes of each node, with a detection range of 50 m and a detection frequency of 60 times per minute. The deep network is simulated using replay size (=500), batch size (=64), discount factor (=0.9), exploration factor (=0.05), and learning rate (=0.0001). We evaluate about 10 episodes for all methods and plot the average results in the following sections.

Table 3.

Parameters for experiment.

Parameter Value
Duration range 30-90 seconds
Max lane speed 20m/s
Max speed of car 15m/s
Detector Frequency 60
Detector range 50m
Replay size 500
Batch size 64
Discount factor Inline graphic 0.9
Exploration factor Inline graphic 0.05
Learning rate Inline graphic 0.001

Results and discussion

In this sub-section, we evaluate the performance of our method, SHLight, against other baseline approaches in varied traffic environments, and conduct ablation experiments to validate the effectiveness of each module. Figure 9 shows the training curves of the different methods in a traffic environment, where the blue line represents the proposed method and the other five lines represent the baseline methods. Evidently, FT exhibited a relatively smooth performance, whereas the reward values of the DQN and DDPG increased as the training episodes increased, reaching a smaller fluctuation range when the training step size reached 200. Furthermore, FMA2C achieved better results than DQN and DDPG. However, it exhibits a slow convergence. We can find that MAHPPO converges the fastest. However, our SHLight achieves the best results. We can see that when the number of training steps reaches 1000, our method converges and the reward value exceeds that of all baselines.

Fig. 9.

Fig. 9

Training process of reinforcement learning method.

Effects of synthetic configurations

We evaluated the performance of SHLight and other baseline methods for various synthetic configurations and compared their efficiencies using multiple metrics. The results are presented in Table 4, where the first column represents the configuration indices listed in Table 2. The queue length, waiting time, and delay, which were used to measure the efficiencies, are the average values of all intersections. In Configuration 1, wherein the traffic density was low, SHLight outperformed the FMA2C method. However, due to the transferability of MAHPPO, this method is slightly better than SHLight. For Configuration 2, SHLight outperformed the best across all metrics, indicating that as the traffic density increases, SHLight leads to substantial reductions in queue length, waiting time, and delay compared with the baseline methods. Finally, for Configuration 3, SHLight demonstrated improvements of 25.7, 32.7, and 9.2 over the FMA2C for queue length, waiting time, and delay, respectively. The enhanced performance of SHLihgt stems from its innovative hierarchical reinforcement learning architecture. Unlike other hierarchical methods where the high-level agent relies solely on coarse-grained global information – resulting in one-way control (top-down only) – SHLight integrates fine-grained local information from low-level agents into its high-level decision-making. At the same time, the introduction of the yellow light and the use of a continuous action space make SHLight a more precise control of the signal than other methods, thus making traffic flow more efficient.

Table 4.

Model performances for the synthetic data.

Conf. Model Queue Len. (m) Waiting times (s) Delay
1 FT 58.60 4.52 1.35
DQN 48.03 3.01 0.04
DDPG 39.60 0.88 0.03
FMA2C 43.21 2.87 0.06
MAHPPO 35.47 0.85 0.03
SHLight 36.00 1.39 0.03
2 FT 234.60 146.05 1.74
DQN 142.04 134.43 1.39
DDPG 121.20 116.44 0.92
FMA2C 111.39 107.76 0.87
MAHPPO 102.54 96.22 0.82
SHLight 98.25 81.98 0.79
3 FT 573.93 448.59 4.59
DQN 393.22 389.48 3.44
DDPG 338.07 375.66 3.01
FMA2C 246.09 297.65 2.29
MAHPPO 223.68 230.63 2.24
SHLight 182.73 200.29 2.08

Effects of region division

We evaluated the impact of the number of regions on the model performance. As shown in Fig. 10, we varied the number of regions in sets Inline graphic. We can find that the best division of the synthetic and Wujiaochang road networks is 5 and 6, respectively. Comparing the results of the synthetic and Wujiaochang road networks, we can find that the number of regions has a small impact on the synthetic road network, but the impact on the Wujiaochang road network is greater. Additionally, we can find that the model has better performance when the number of regions in the synthetic road network is small, but the model has better performance when the number of regions in the Wujiaochang road network is large. This is because when the traffic condition is relatively simple, it is helpful for the model to divide a larger area to make full use of the global information, and when the traffic flow condition is relatively complex, it is helpful for the model to improve its performance.

Fig. 10.

Fig. 10

Queue lengths and waiting times with different numbers of regions for the two datasets: (a) 5 Inline graphic 5 and (b) Wujiaochang road networks.

Effects of sample importance, macro- and micro-control

We evaluated the impact of the sample importance, and macro- and micro-control of the manager by removing them, respectively, and we conducted experiments on the synthetic and Wujiaochang road networks. In SHLight (micro), SHLight utilizes only the micro-control in managers, while in SHLight (macro), SHLight utilizes only the macro-control in managers. In order to ensure the comparability between the synthetic road network and the Wujiaochang road network, we have selected a traffic configuration similar to the density of the Wujiaochang road network on the synthetic road network (i.e. Config-2). As shown in Fig. 11, all these modules have improved the performance of the model to some extent. As we can see, SHLight (micro) has the worst results in most traffic scenarios except Config-1, revealing a redundancy between micro-control states and worker states. This redundancy arises from the fact that the input of micro-control is constructed by concatenating the states of all workers within the region. Since each worker already operates on its local state, the micro-control layer essentially reprocesses information already available to individual workers, offering minimal new insights for regional coordination. The superior performance of SHLight (micro) in Config-1 can be attributed to the higher relevance of fine-grained micro-level information under low-traffic conditions. Besides, the comparison between SHLight w/o SI and SHLight demonstrates that sample selection consistently enhances model performance across all traffic densities. Unlike macro- and micro-control, the effectiveness of sample importance (SI) is not influenced by traffic flow variations, underscoring its robustness in diverse traffic scenarios. Table 5 shows the ablation study of LSTM and autoencoder (AE) in the sample selection module on synthetic data with Config-2. The results show that both LSTM and AE can improve model performance. Additionally, comparing the results of Config-2 and the Wujiaochang road network, which have similar traffic densities, we can find that this improvement is more obvious on the Wujiaochang road network, which shows that our approach is more advantageous in complex traffic conditions.

Fig. 11.

Fig. 11

Effect of sample importance and assistant.

Table 5.

Ablation study of sample importance on synthetic data.

Model Queue Len. (m) Waiting times (s) Delay
SHLight w/o SI 99.07 95.44 0.82
SHLight (SI w/o LSTM) 99.00 92.80 0.79
SHLight (SI w/o AE) 98.79 87.46 0.81
SHLight 98.25 81.98 0.79

Performance for real-world data

Our comparative analysis of SHLight and baseline methods on real-world datasets demonstrates SHLight’s superior performance across key traffic metrics. As summarized in Table 6, SHLight achieved the lowest queue lengths and waiting times in both scenarios. While the delay value in the Ingolstadt scenario showed minimal differences between methods (with SHLight not achieving the absolute lowest value), SHLight maintained a consistent performance advantage when considering all evaluation metrics collectively. This comprehensive superiority highlights SHLight’s robustness in diverse urban traffic conditions. To evaluate the traffic change trends and compare the results, Wujiaochang data were collected every 10 min from 6:00 a.m. to 10:30 p.m. for each intersection in SUMO to plot the results. Figure 12 shows the cumulative queue length results of the Wujiaochang area for all models. The blue line represents the proposed algorithm, and the other lines represent the baseline methods. Evidently, the proposed control algorithm outperformed the other algorithms. Specifically, the queue lengths during the early morning hours did not significantly differ under other control methods as the traffic flow was minimal. However, from the first traffic increase at 8:00 a.m. until the afternoon, the FT method is clearly inferior to the other intelligent strategies. At 4:00 p.m., which was the peak flow period, the proposed method had smoother traffic flow changes and reduced the queue length to a normal level more rapidly than the other methods. Figure 13 shows the cumulative waiting times under the different methods for the Wujiaochang data. The overall trend was similar to that seen in Fig. 12, verifying the positive correlation between the two metrics. Specifically, before the arrival of peak traffic flow, the waiting times differed negligibly among all methods. However, during the peak period, the waiting time under the FT method increased sharply, indicating that some vehicles in the lane stopped twice at one intersection. Finally, the results of the cumulative delay are shown in Fig. 14. Evidently, after the peak traffic period, the cumulative delay under the proposed method returns to the normal range more quickly than under the other baseline methods, indicating that the traffic flow resumed to normal in a shorter time. Additionally, we can find that from 8:00 a.m. to 10:00 a.m., that is, during the morning rush hour, only our method shows a relatively obvious small peak, which indicates that our method makes full use of the traffic flow change information. Overall, the proposed method demonstrates superior control effects not only under low traffic periods but also during peak and changing flows. By leveraging the DRL component and obtaining the global traffic information of the region where each intersection is located via the HRL structure, the proposed method ensures more efficient traffic flow at all intersections.

Table 6.

Model performances for real-world data.

Model Wujiaochang Ingolstadt
Queue (m) Wait (s) Delay Queue (m) Wait (s) Delay
FT 370.33 344.65 3.14 182.44 119.81 1.56
DQN 231.88 219.92 2.31 139.09 94.75 0.80
DDPG 188.78 160.06 2.01 115.34 90.47 0.61
FMA2C 169.34 137.67 1.22 96.88 87.62 0.59
MAHPPO 153.26 121.95 1.23 79.36 71.55 0.56
SHLight 144.12 115.87 1.19 78.51 60.49 0.57
Fig. 12.

Fig. 12

Cumulative queue length results for the proposed and baseline methods on Wujiaochang area.

Fig. 13.

Fig. 13

Cumulative waiting time results for the proposed and baseline methods on Wujiaochang area.

Fig. 14.

Fig. 14

Cumulative delay results for the proposed and baseline methods on Wujiaochang area.

Effects of traffic capacity

According to previous experience, the performance difference between methods typically increases with the increase in traffic density. However, in our experiments, the advantages of our method did not consistently align with this expectation as traffic density increased. For example, the traffic density of Config-2, Wujiaochang road network, and Config-3 is progressively higher. Despite this, the waiting time for SHLight increased by 11.97%, 4.99%, and 7.90% respectively compared to MAHPPO, which is contrary to our initial expectations. Therefore, we investigated the impact of traffic capacity on model performance. As shown in Fig. 15, we set the number of lanes in the road network to 2, 3, and 4, respectively. The y-axis represents the percentage improvement of our method over the MAHPPO model. The results indicate that increasing the traffic capacity of the road network widens the performance gap between our method and MAHPPO, thereby highlighting the advantages of our approach. Additionally, as the flow capacity of the road network increases, the improvement across different metrics becomes more consistent.

Fig. 15.

Fig. 15

Effect of traffic capacity.

Effects of adaptability on heterogeneous intersections

To demonstrate the adaptability and performance in environments with highly heterogeneous intersection geometries, we retained all the heterogeneous intersections on the Wujiaochang map we downloaded (shown in Fig. 7), which contains 47 intersections. Due to the highly complex road network in the Wujiaochang area (including T-junctions and five-way intersections), this road network is sufficient to verify the adaptability in environments with highly heterogeneous intersection geometries of SHLight. Specifically, we transform heterogeneous intersections into regular intersections through three operations: padding, splitting, and merging. We then employ the phase mask vector to represent the action space of agents at intersections with different structures (detailed in the work34). As shown in Table 7, SHLight performs slightly worse on the heterogeneous road network than on the regular road network. This is due to two reasons: first, the road network becomes more complex in terms of both structure and the number of intersections, thereby increasing the difficulty of traffic light collaboration; second, the intersection features obtained after transforming the heterogeneous intersections cannot fully represent the real environment. Nevertheless, SHLight still outperforms the baseline methods in experiments on regular road networks, demonstrating its adaptability in environments with highly heterogeneous intersection geometries.

Table 7.

Adaptability study of on heterogeneous intersections.

Structure Queue Len. (m) Waiting times (s) Delay
Regular 144.12 115.87 1.19
Heterogeneous 148.93 117.09 1.21

Conclusion

In this paper, we propose SHLight, an innovative HRL approach devised to tackle the problem of global coordination in traffic signal control through a hierarchical framework. By dividing the traffic network into regions overseen by manager agents, with traffic lights controlled by worker agents, we constructed a framework where the manager optimizes global coordinative objectives and assigns sub-goals to the corresponding workers within their respective regions. Through collaborative efforts within each region, workers not only satisfy their individual local targets but also achieve the goals set by the manager. In order to solve the problem of non-stationary in HRL, we designed a method to calculate the importance of samples and select samples according to the sample importance calculated by this method. Additionally, based on the basic actor-critic network framework, we added another actor-critic network to obtain the auxiliary actions of agents, thus improving the observability of the model. We conducted experiments on both synthetic and real-world datasets, demonstrating that SHLight can provide adaptive control over global road networks and significantly outperform state-of-the-art RL methods.

In future work, we will extend our approach to real-world datasets with various real-world factors like traffic signal malfunctions, accidents, or external events, and we will focus on solving the robustness of our model.

Author contributions

Conceptualization, J.S.; methodology, J.S.; validation, J.S; writing-original draft preparation, J.S.; writing-review & editing, J.S. ; supervision, J.S.

Data availability

The authors confirm that the data supporting the findings of this study are available within the article.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Abdoos, M., Mozayani, N. & Bazzan, A. L. C. Holonic multi-agent system for traffic signals control. Eng. Appl. Artif. Intell.26, 1575–1587 (2013). [Google Scholar]
  • 2.Di, X. et al. Traffic congestion prediction by spatiotemporal propagation patterns. In 20th IEEE International Conference on Mobile Data Management, MDM 2019, Hong Kong, SAR, China, June 10-13, 2019 298–303 (2019).
  • 3.Liang, X., Du, X., Wang, G. & Han, Z. A deep reinforcement learning network for traffic light cycle control. IEEE Trans. Veh. Technol.68, 1243–1253 (2019). [Google Scholar]
  • 4.Wei, W. & Zhang, Y. FL-FN based traffic signal control. In Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE’02, Honolulu, Hawaii, USA, May 12–17, 2002 296–300 (2002).
  • 5.Wei, H. et al. Colight: Learning network-level cooperation for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3–7, 2019 1913–1922 (2019).
  • 6.Wei, H., Zheng, G., Yao, H. & Li, Z. Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018 2496–2505 (2018).
  • 7.Vezhnevets, A. S. et al. Feudal networks for hierarchical reinforcement learning. In Precup, D. & Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70 of Proceedings of Machine Learning Research 3540–3549 (PMLR, 2017).
  • 8.Nachum, O., Gu, S., Lee, H. & Levine, S. Data-efficient hierarchical reinforcement learning (2018). arXiv:1805.08296.
  • 9.Xu, B., Wang, Y., Wang, Z., Jia, H. & Lu, Z. Hierarchically and cooperatively learning traffic signal control. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021 669–677 (2021).
  • 10.Ma, J. & Wu, F. Feudal multi-agent deep reinforcement learning for traffic signal control. In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’20, Auckland, New Zealand, May 9-13, 2020 816–824 (2020).
  • 11.Casas, N. Deep deterministic policy gradient for urban traffic light control. CoRRabs/1703.09035 (2017).
  • 12.El-Tantawy, S., Abdulhai, B. & Abdelgawad, H. Design of reinforcement learning parameters for seamless application of adaptive traffic signal control. J. Intell. Transp. Syst.18, 227–245 (2014). [Google Scholar]
  • 13.Li, L., Lv, Y. & Wang, F.-Y. Traffic signal timing via deep reinforcement learning. IEEE/CAA J. Autom. Sin.3, 247–254 (2016). [Google Scholar]
  • 14.Kaelbling, L. P., Littman, M. L. & Moore, A. W. Reinforcement learning: A survey. J. Artif. Intell. Res.4, 237–285 (1996). [Google Scholar]
  • 15.Wiering, M. A. Multi-agent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000) 1151–1158 (2000).
  • 16.Zhou, R., Nousch, T., Wei, L. & Wang, M. Constrained traffic signal control under competing public transport priority requests via safe reinforcement learning. Expert Syst. Appl.284, 127676. 10.1016/J.ESWA.2025.127676 (2025). [Google Scholar]
  • 17.Abdulhai, B., Pringle, R. & Karakoulas, G. J. Reinforcement learning for true adaptive traffic signal control. J. Transp. Eng.129, 278–285 (2003). [Google Scholar]
  • 18.Prashanth, L. A. & Bhatnagar, S. Reinforcement learning with function approximation for traffic signal control. IEEE Trans. Intell. Transp. Syst.12, 412–421 (2011). [Google Scholar]
  • 19.Mannion, P., Duggan, J. & Howley, E. Parallel reinforcement learning for traffic signal control. In Proceedings of the 6th International Conference on Ambient Systems, Networks and Technologies (ANT 2015), the 5th International Conference on Sustainable Energy Information Technology (SEIT-2015), London, UK, June 2–5, 2015 vol. 52, 956–961 (2015).
  • 20.Zhang, Y., Zhou, Y. & Fujita, H. Distributed multi-agent reinforcement learning for cooperative low-carbon control of traffic network flow using cloud-based parallel optimization. IEEE Trans. Intell. Transp. Syst.25, 20715–20728. 10.1109/TITS.2024.3452430 (2024). [Google Scholar]
  • 21.Bie, Y., Ji, Y. & Ma, D. Multi-agent deep reinforcement learning collaborative traffic signal control method considering intersection heterogeneity. Transp. Res. Part C: Emerg. Technol.164, 104663. 10.1016/j.trc.2024.104663 (2024). [Google Scholar]
  • 22.Zeng, J. et al. Halight: Hierarchical deep reinforcement learning for cooperative arterial traffic signal control with cycle strategy. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC) 479–485. 10.1109/ITSC55140.2022.9921819 (2022).
  • 23.Li, C., Yan, H. & Zhao, Q. Efficient policy transfer in large-scale traffic light control via multi-agent hierarchical reinforcement learning. In 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE) 1–6. 10.1109/CASE56687.2023.10260400 (2023).
  • 24.Goyal, A. et al. Reinforcement learning with competitive ensembles of information-constrained primitives (2019). arXiv:1906.10667.
  • 25.El-Tantawy, S., Abdulhai, B. & Abdelgawad, H. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown toronto. IEEE Trans. Intell. Transp. Syst.14, 1140–1150 (2013). [Google Scholar]
  • 26.Wangapisit, O., Taniguchi, E., Teo, J. S. & Qureshi, A. G. Multi-agent systems modelling for evaluating joint delivery systems. Procedia Soc. Behav. Sci.125, 472–483 (2014). [Google Scholar]
  • 27.Yang, Y. & Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. CoRRabs/2011.00583 (2020).
  • 28.Bacon, P., Harb, J. & Precup, D. The option-critic architecture. In Singh, S. & Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA 1726–1734. 10.1609/AAAI.V31I1.10916 (AAAI Press, 2017).
  • 29.Levy, A., Platt, R. & Saenko, K. Hierarchical reinforcement learning with hindsight (2019). arXiv:1805.08180.
  • 30.Karypis, G. & Kumar, V.: Metis - unstructured graph partitioning and sparse matrix ordering system, version 2.0. technical report. Applied Physics Letters, Volume 97, Issue 12, id. 124101 (3 pages)(2010) (1995).
  • 31.da Silva, A. B. C., de Oliveria, D. & Basso, E. Adaptive traffic control with reinforcement learning. In Proceedings of the 4th Workshop on Agents in Traffic and Transportation (AAMAS 2006)(May 2006) 80–86 (2006).
  • 32.Lobo, S. C., Neumeier, S., Fernández, E. M. G. & Facchi, C. Intas - the ingolstadt traffic scenario for SUMO. In López, P. Á. et al. (eds.) SUMO User Conference 2020, Virtual Event, October 26-28, 2020, vol. 1 of SUMO Conference Proceedings 73–92. 10.52825/SCP.V1I.102 (TIB Open Publishing, 2020).
  • 33.Chu, T., Wang, J., Codecà, L. & Li, Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. CoRRabs/1903.04527 (2019).
  • 34.Shen, J., Hu, J., Zhao, Q. & Rao, W. Heterogeneous traffic intersections control design based on reinforcement learning. IET Intelligent Transport Systems18, 1760–1776, 10.1049/itr2.12408 (2024). https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/itr2.12408.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES