Abstract
Traffic problems often occur due to the traffic demands by the outnumbered vehicles on road. Maximizing traffic flow and minimizing the average waiting time are the goals of intelligent traffic control. Each junction wants to get larger traffic flow. During the course, junctions form a policy of coordination as well as constraints for adjacent junctions to maximize their own interests. A good traffic signal timing policy is helpful to solve the problem. However, as there are so many factors that can affect the traffic control model, it is difficult to find the optimal solution. The disability of traffic light controllers to learn from past experiences caused them to be unable to adaptively fit dynamic changes of traffic flow. Considering dynamic characteristics of the actual traffic environment, reinforcement learning algorithm based traffic control approach can be applied to get optimal scheduling policy. The proposed Sarsa(λ)-based real-time traffic control optimization model can maintain the traffic signal timing policy more effectively. The Sarsa(λ)-based model gains traffic cost of the vehicle, which considers delay time, the number of waiting vehicles, and the integrated saturation from its experiences to learn and determine the optimal actions. The experiment results show an inspiring improvement in traffic control, indicating the proposed model is capable of facilitating real-time dynamic traffic control.
1. Introduction
In most major cities, hundreds of thousands of vehicles distribute in a large and board area. It is a tough and complex work for us to effectively deal with such a large-scale, dynamic, and distributed system with a high degree of uncertainty [1]. Apart from the increasing number of vehicles in urban area, the fact that most of present traffic control systems have not taken full advantage of intelligent control of traffic light is one of the most important one [2]. People [3] have found that reasonable traffic control and improving the utilization efficiency of roads is an effective and economical way to solve the urban traffic problem for most cities. Traffic signal lights control policy, the most important part of intelligent transportation system, turns out to be even more essential [4].
However, as there are so many factors that affect the traffic lights control, off-line control policy model is not suitable for sudden and sporadic characteristics of road. Hereby, in this paper, we propose an online traffic control model which is based on Sarsa(λ) [5]. In our model, several traffic signal control modes are treated as candidate action selections; the vehicle speed and saturation of an intersection are viewed as context of environment, and common signal control indicators, including delay time, the number of waiting vehicles, and the integrated saturation are defined as return. In the experiments, the proposed model showed its ability to facilitate real-time traffic control.
2. Related Work
At present, the traffic control systems can be classified into static traffic control systems and dynamic traffic control systems, where the former often uses statistical approached to optimize the settings while the latter can adjust traffic controller duration dynamically according to real-time traffic conditions.
Many achievements in collaborative traffic flow guidance and control strategy have been made. The F-B method [6] has been widely used by many researchers and engineers of the transportation industry. By using the approach, the traffic jam problem was partly solved. Thereafter, there came many improved approaches [7] based on the F-B method. Driving compensation coefficient, along with delay time, was used to evaluate the efficiency of time allocation scheme [8]. The model minimized delay of waiting time, making the approach appear to be acute and reasonable. However, as the model could hardly deal with heavy traffic, we still need to find a more suitable approach.
The ability of intelligent traffic control as a good solution to traffic congestion problem has gradually received more and more attention [9]. However, congestion problems between adjacent intersections still need more efforts. The regional coordination control proved to be a good solution to this problem [10]. Although many area coordinated control methods were proposed, few of them yielded good results due to the lack of a clear regional control mathematical model, especially in complex environment with heavy traffic. However, due to the complexity and changeability, it is of little possibility to build an accurate mathematical model for traffic system in advance [11].
It has become a trend to solve traffic problems by taking advantage of computing technology and machine intelligence [12]. Among many machine learning approaches, reinforcement learning is suitable for the optimal control of the transportation system strategy as it does not require mathematical models of the external environment [13]. The study using the Q-learning algorithm [14] achieved online traffic control. The approach was able to choose the optimal coordination model under different traffic conditions. Some applications [15] that utilize Q-learning algorithm have received much significant effect. A paper implemented an online traffic control through Q-learning algorithm, yielding good effort in the normal state of traffic congestion [16].
3. Traffic Evaluation Indicators
Signal lights control plays a very important role in traffic management. A reasonable and good semaphores time allocation scheme guarantees that under normal circumstances the traffic moves smoothly. Frequently used traffic efficiency evaluation indicators [17] include delay time, the number of waiting vehicles, and intersection saturation.
3.1. Delay Time
The indicator delay time refers to the delay between the actual time and theoretically computational time for a vehicle to pass an intersection. In practice, we can get total delay time during a certain period of time and average delay time of a cross to evaluate the time difference. The more delay time represents the slower average speed of a vehicle to pass an intersection.
3.2. Number of Waiting Vehicles
The number of waiting vehicles shows how many vehicles are waiting behind stop line to pass the road intersection. The indicator [17] is used to measure the smooth degree of road as well as the road traffic flow. It is defined as
| (1) |
where waitG is the number of waiting vehicles before the green light and waitR is the number of waiting vehicles before the red light.
3.3. Intersection Saturation
The indicator intersection saturation denotes the ratio of the actual traffic flow to the maximum available traffic flow. Intersection saturation is calculated as
| (2) |
where dr is the ratio of red light duration to green light duration and sf is saturation flow of the intersection.
3.4. Traffic Flow Capacity
Traffic flow capacity represents the maximum possible number of vehicles passing through the intersection. The indicator reflects effect of signal control strategy. We can see that traffic flow capacity is related to traffic signal duration. A longer passing duration generally yields a stronger passing capacity.
4. Temporal Difference Learning
Reinforcement learning is a framework to learn directly from the interaction and thereby achieve goals [13, 18]. Reinforcement learning framework is abstract and flexible and can be applied in many different applications.
In artificial intelligence field, agent is defined as an entity that has cognitive skills, the ability to solve the problem, and the ability to communicate with the outside environment. By agent, we can establish some system for controlling model. In fact, the model based on agent is an anthropomorphic model; as a result, we can control the behavior of people in the system and unify other control units, providing a unified description of the method. Agents are connected through network; agents act as intelligent nodes on the network, therefore constructing a distributed multiagent system.
The agent model of intersection is as Figure 1, including environment perception module, learning module, decision module, execution module, knowledge base, communication module, and coordination module.
Figure 1.

The agent model of intersection, including environment perception module, learning module, decision module, execution module, knowledge base, communication module, and coordination module.
In reinforcement learning framework, agent is a learner and decision-maker, interacting with environment which is everything outside of agent. Agent chooses an action; the environment responds to the action, generates new scenes to the agent, and then returns a reward. The framework [13, 18] of reinforcement learning is shown in Figure 2.
Figure 2.

Framework of reinforcement learning. Agent selects an action; the environment responds to the action, generates new scenes to the agent, and then returns a reward.
Agent interacts with the environment at each step during a discrete-time sequence (t = 0,1, …). At each time step t, agent gets the representation of environment denoted by state s t ∈ S, where S is the set of all possible states; agent chooses an action a t ∈ A(s t), where a t ∈ A(s t) is all available actions. By taking the action, agent receives a reward r t+1 ∈ R and gets to a new status s t+1. The ultimate goal of agent is to maximize the sum of the rewards in long term. The mapping from state to action selection is policy of the agent, denoted by π t. Reinforcement learning solves how agent changes policy through experience.
The temporal difference (TD) learning is capable of learning directly from raw experience without determining dynamic model of environment in advance. Moreover, the model learned by temporal difference is updated by estimation which is based on part of learning rather than final results of the learning. These two characteristics of temporal difference make it particularly suitable for solving the prediction problems and control problems in real-time control applications. Given some experience with policy π, temporal difference learning updates estimated V of V π [19], as
| (3) |
where R t is actual return after time step t and α is a step size parameter. Temporal difference learning updates V in step t + 1 using the observed reward r t+1 and estimated V(S t+1).
Let Q π(s, a) be the value of taking action a, in S under a policy. Q π(s, a) [20] can be defined as
| (4) |
Sarsa (state-action-return-state-action) is an on-line TD control method. Sarsa(λ) is an eligibility trace [21] version of Sarsa. The update of Q π(s, a) [22] depends on
| (5) |
where δ t = r t+1 + γQ t(s t+1, a t+1) − Q t(s t, a t) and
| (6) |
5. Sarsa(λ)-Based Traffic Control Model
In the transport network, maximizing traffic flow and minimizing the average waiting time is the goal of scheduling and control. In traffic scheduling, junctions compete with other junctions fighting for larger traffic flow. During the course, junctions form a policy of coordination as well as constraints for adjacent junctions to maximize their own interests. Considering dynamic characteristics of the actual traffic environment, reinforcement learning algorithm based traffic control approach can be applied to get optimal scheduling policy.
In practical environment, traffic flows of four-intersections with twelve flow directions are very complex. As shown in Figure 3, there are altogether four intersections: Ia, Ib, Ic, and Id, where
-
X in is the intersection saturation of vehicle to intersection Ia,
-
X ab is the intersection saturation from intersection Ia to intersection Ib,
-
X ac is the intersection saturation from intersection Ia to intersection Ic,
-
X ad is the intersection saturation from intersection Ia to intersection Id,
-
X ba is the intersection saturation from intersection Ib to intersection Ia,
-
X bc is the intersection saturation from intersection Ib to intersection Ic,
-
X bd is the intersection saturation from intersection Ib to intersection Id,
-
X ca is the intersection saturation from intersection Ic to intersection Ia,
-
X cb is the intersection saturation from intersection Ic to intersection Ib,
-
X cd is the intersection saturation from intersection Ic to intersection Id,
-
X da is the intersection saturation from intersection Id to intersection Ia,
-
X db is the intersection saturation from intersection Id to intersection Ib,
-
X dc is the intersection saturation from intersection Id to intersection Ic.
Figure 3.

Traffic flow and control of four intersections with twelve flow directions.
The control coordination between the intersections can be viewed as a Markov process, denoted by 〈S, A, R〉, where S represents the state of the intersection, A stands for the action for traffic control, and R indicates the return attained by the control agent.
5.1. Definition of State
Agent gets real-time traffic state and then returns traffic control decision by current state of the road. Some most important data such as intersection saturation and vehicle speed are used to reflect the state of road traffic.
Nevertheless, the traffic state is continuous, although reinforcement learning being capable of handling continuous state [21, 23] tends to make the model more complex. To simplify the algorithm, we hereby discretise saturation state and vehicle speed. The discrete saturation and speed values are shown in Table 1.
Table 1.
Discrete saturation and speed values.
| Saturation | Discrete value | Speed range (m/h) | Discrete value |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0.1 | 1 | (0,10] | 1 |
| 0.2 | 2 | (10,20] | 2 |
| 0.3 | 3 | (20,30] | 3 |
| 0.4 | 4 | (30,40] | 4 |
| 0.5 | 5 | (40,50] | 5 |
| 0.6 | 6 | (50,60] | 6 |
| 0.7 | 7 | >60 | 7 |
Hereby, we can obtain altogether 49 possible states by combining 7 saturations and 7 speed ranges (7∗7 = 49).
5.2. Definition of Action
In reinforcement learning framework, policy defines the learning agent behaviour at a given time. It in fact is a mapping from perceived states to available actions. Reinforcement learning model obtains rewards by mapping the scene to the action which affects not only the direct rewards but also the next scene, so that all subsequent rewards will be influenced. Specific states and actions are very different in various applications.
In general, traffic lights control contains five major adjustment modes: increasing green signal light duration, reducing green signal light duration, extending the signal cycle; shortening the signal light cycle, and setting all lights to red. In our study, traffic lights control actions can be categorized to 6 types: keeping the signal lights unchanged in stopping signal lights timing, extending the signal lights duration, shortening signal lights duration, setting signal lights to yellow, and setting signal lights to red. Each of them is for one of the following actual traffic scenarios.
The policy keeping the signal lights unchanged is used in the case of the normal traffic flow when the lights control strategies do not change.
The policy stopping signal lights timing is used when the traffic one direction is blocked while traffic on the other direction is normal. The policy is the last resort to release one direction traffic jam.
The policy extending the signal duration is mainly used in the case that in one direction traffic flow is blocked and the other direction is normal. Extending the signal duration increases the traffic flow while signal lights are still timing.
The policy shortening signal duration is mainly used in the case that in one direction of traffic flow is small while that of the other direction is large. Reducing signal light duration shortens the waiting time of the other direction and lets vehicles of that direction pass the intersection sooner, while signal lights keep timing.
The policy setting all lights to yellow is used for warning vehicles to slow down and keep watch.
The policy setting all lights to red is to let all the vehicles pass and clear the intersection. This policy is usually used only in emergency or the whole area is badly blocked.
In short, the action and the corresponding value are shown in Table 2.
Table 2.
Action and corresponding value.
| Value | Action |
|---|---|
| 1 | Keeping the signal lights unchanged |
| 2 | Stopping signal lights timing |
| 3 | Extending the signal lights duration |
| 4 | Shortening signal lights duration |
| 5 | Setting signal lights to yellow |
| 6 | Setting signal lights to red |
5.3. Definitions of Reward and Return
Reward function in reinforcement learning defines the goal of the problem. The perceived state of the environment is mapped to a value, reward, representing internal needs of the state. The ultimate goal of reinforcement learning agent is to maximize the total reward in long term.
In our work, agent makes signal control decisions under different traffic conditions and returns an action sequence, so that by the actions the road traffic blocking indicator is the minimum. To be further, the model gives out an optimal traffic coordination mode in a certain traffic state. Here, we use traffic cost indicator to evaluate the traffic flows as
| (7) |
where ω is a weight value, D denotes the average delay time, and W represents the number of waiting vehicles.
5.4. Sarsa(λ)-Based Traffic Control Optimization
Sarsa(λ) learns from the original experience without environment dynamic model; it can obtain experience by interacting with environment with minimum amount of calculation cost experience; and it is a general learning model for a long-term prediction of the dynamic system. Hence we can conclude that Sarsa(λ) is very suitable for the real-time traffic signal control model. Hereby, we propose an on-line Sarsa(λ)-based traffic signal light optimization model, which overcomes the drawbacks of off-line model such as being unable to dealing with complexity and changeability of traffic control system, and requiring some accurate mathematical models. See Algorithm 1.
Algorithm 1.

Sarsa(λ)-based traffic control optimization.
6. Simulation Experiment and Results
To comprehensively evaluate behaviour of the model, we carried on simulation experiments with two different scenarios: one took advantage of an on-line traffic control optimization model and the other utilized an off-line traffic control optimization model. We also did simulation experiments in two different kinds of intersections: the city centre with heavy traffic flow and new distinct of the city with light traffic flow, as shown in Figure 4.
Figure 4.

Snapshots from traffic monitoring system. (a) is snapshot of traffic in center of the city which has a heavy traffic flow and (b) is that of new distinct which has a less traffic flow.
We utilize Sarsa(λ) in our study to learn a controller with learning rate = 0.5, discount rate = 0.9, and λ = 0.6. During learning process, cost was updated 1000 with 6000 episodes. Simulation experiments results in different intersections with an on-line control model and with an off-line control model are showed in Table 3.
Table 3.
Simulation experiments results in different intersection with an on-line control model and with an off-line control model.
| Intersection | Scenario | Average delay time (s) | Average number of waiting vehicles |
|---|---|---|---|
| City centre | with on-line control | 51.2 | 7.42 |
| with off-line control | 59.1 | 7.63 | |
|
| |||
| New distinct | with on-line control | 21.6 | 4.21 |
| with off-line control | 39.5 | 5.89 | |
We can see from Table 3 that the results optimized by the model in new distinction of the city overwhelmingly won those with an off-line optimization model; while in centre of the city, although improved, the margin of the two approaches is narrow. It is mainly because the roads of new distinction have high traffic capacity and traffic flow there is relatively smaller while the traffic flow in the centre of the city is too heavy for any intelligent model to improve.
7. Conclusions
Because traffic control system is so complex and changeable that an off-line traffic control model with predefined strategy can hardly cope with the traffic congestion and sudden traffic accidents which actually may occur at any time, the demand for combining timely and intelligent traffic control policy with real-time road traffic is getting more and more urgent.
Reinforcement learning accumulates experiment and knowledge by keeping interaction with environment. Although it usually needs a long duration to complete learning, it has pretty good learning ability to complex system, enabling it to handle unknown complex states well. The application of reinforcement learning in traffic management area is gradually receiving more and more concerns.
In this work, we, under the framework of reinforcement learning, propose a Sarsa(λ)-based learning algorithm for traffic control optimization. The actual continuous traffic states are discretized for the purpose of simplification. We design actions for traffic control and define reward and return by mean of traffic cost which combines with multiple traffic capacity indicators.
In the simulation testing experiment, we evaluated the behavior of traffic control with optimization in new distinct of the city as well as in the centre of the city. The results of traffic control optimized by our proposed on-line model were better than those optimized by off-line model.
Acknowledgments
This work was funded by the National Natural Science Foundation of China (61303108, 61373094, and 61272005), High School Natural Foundation of Jiangsu (13KJB520020), Natural Science Foundation of Jiangsu (BK2012616).
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Authors' Contribution
Fei Zhu conceived and designed the experiments. Xiaoke Zhou Performed the experiments. Xiaoke Zhou analyzed the data. Xiaoke Zhou contributed reagents/materials/analysis tools. Xiaoke Zhou and Fei Zhu Wrote the paper.
References
- 1.Zhu F, Ning J, Ren Y, Peng J. Optimization of image processing in video-based traffic monitoring. Elektronika ir Elektrotechnika. 2012;18(8):91–96. [Google Scholar]
- 2.de Schutter B. Optimal traffic light control for a single intersection. Proceedings of the American Control Conference (ACC ’99); June 1999; pp. 2195–2199. [Google Scholar]
- 3.Findler N, Stapp J. A distributed approach to optimized control of street traffic signals. Journal of Transportation Engineering. 1992;118(1):99–110. [Google Scholar]
- 4.Baskar LD, de Schutter B, Hellendoorn H. Traffic management for automated highway systems using model-based predictive control. IEEE Transactions on Intelligent Transportation Systems. 2012;3(2):838–847. [Google Scholar]
- 5.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, Mass, USA: MIT Press; 1998. [Google Scholar]
- 6.Mase K, Yamamoto H. Advanced traffic control methods for network management. IEEE Communications Magazine. 1990;28(10):82–88. [Google Scholar]
- 7.Baskar LD, de Schutter B, Hellendoorn J, Papp Z. Traffic control and intelligent vehicle highway systems: a survey. IET Intelligent Transport Systems. 2011;5(1):38–52. [Google Scholar]
- 8.Broucke M, Varaiya P. A theory of traffic flow in automated highway systems. Transportation Research C. 1996;4(4):181–210. [Google Scholar]
- 9.Choi W, Yoon H, Kim K, Chung I, Lee S. A traffic light controlling FLC considering the traffic congestion. Proceedings of the International Conference on Fuzzy Systems (AFSS ’02); 2002; pp. 69–75. [Google Scholar]
- 10.Wiering MA. Multi-agent reinforcement learning for traffic light control. Proceedings of the 17th International Conference on Machine Learning (ICML ’00); 2000; pp. 1151–1158. [Google Scholar]
- 11.Helbing D, Hennecke A, Shvetsov V, Treiber M. Micro- and macro-simulation of freeway traffic. Mathematical and Computer Modelling. 2002;35(5-6):517–547. [Google Scholar]
- 12.Zegeye S, de Schutter B, Hellendoorn J, Breunesse EA, Hegyi A. A predictive traffic controller for sustainable mobility using parameterized control policies. IEEE Transactions on Intelligent Transportation Systems. 2012;13(3):1420–1429. [Google Scholar]
- 13.Bonarini A, Lazaric A, Montrone F, Restelli M. Reinforcement distribution in fuzzy Q-learning. Fuzzy Sets and Systems. 2009;160(10):1420–1443. [Google Scholar]
- 14.Chin YK, Wei YK, Wei LK, Min KT, Teo KTK. Q-learning traffic signal optimization within multiple intersections traffic network. Proceedings of the 6th UKSim/AMSS European Symposium on Computer Modeling and Simulation (EMS ’12); November 2012; pp. 343–348. [Google Scholar]
- 15.Prashanth LA, Bhatnagar S. Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems. 2011;12(2):412–421. [Google Scholar]
- 16.Chin YK, Lee LK, Bolong N, Yang SS, Teo KTK. Exploring Q-learning optimization in traffic signal timing plan management. Proceedings of the 3rd International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN ’11); July 2011; pp. 269–274. [Google Scholar]
- 17.Ben-Akiva M, Cuneo D, Hasan M, Jha M, Yang Q. Evaluation of freeway control using a microscopic simulation laboratory. Transportation Research C. 2003;11(1):29–50. [Google Scholar]
- 18.Sutton RS. Learning to predict by the methods of temporal differences. Machine Learning. 1988;3(1):9–44. [Google Scholar]
- 19.Watkins CJCH, Dayan P. Q-learning. Machine Learning. 1992;8(3-4):279–292. [Google Scholar]
- 20.Wiewiora E. Potential-based shaping and Q-value initialization are equivalent. Journal of Artificial Intelligence Research. 2003;19:205–208. [Google Scholar]
- 21.Martin M. On-line support vector machine regression. Proceedings of the European Conference on Machine Learning (ECML’02); 2002; pp. 173–198. [Google Scholar]
- 22.Mastronarde N, van der Schaar M. Online reinforcement learning for dynamic multimedia systems. IEEE Transactions on Image Processing. 2010;19(2):290–305. doi: 10.1109/TIP.2009.2035228. [DOI] [PubMed] [Google Scholar]
- 23.Wang Q, Zhan Z. Reinforcement learning model, algorithms and its application. Proceedings of the International Conference on Mechatronic Science, Electric Engineering and Computer (MEC ’11); August 2011; pp. 1143–1146. [Google Scholar]
