Traffic Signal Control Using Hybrid Action Space Deep Reinforcement Learning

. 2021 Mar 25;21(7):2302. doi: 10.3390/s21072302

Algorithm 1 Traffic Signal Control Using Parameterized Deep RL.

1:
Initialize: Learning rates ${l r_{Q}, l r_{x}}$ , exploration parameter $ϵ$ , minibatch size B, a probability distribution $ζ$ , flow configurations, network weights $ω_{0}$ and $θ_{0}$ .
2:
for episode $e = 1, \dots E$ do
3:
Start simulation, observe initial state $s_{0}$ and take initial joint action $a_{0}$ .
4:
for $t = 1, \dots T$ do
5:
Compute action parameters ${d_{P}}_{t} \leftarrow {x_{d}}_{P} (s_{t}; θ_{t})$ .
6:
Select action $a_{t} = (P_{t}, {d_{P}}_{t})$ according to $ϵ$ -greedy policy.
$a_{t} = \{\begin{matrix} a sample from ζ with probability ϵ, \\ (P_{t}, {d_{P}}_{t}) P_{t} = \underset{P}{argmax} Q (s_{t}, P, {d_{P}}_{t}; ω_{t}) 1 - ϵ . \end{matrix}$
7:
Perform $a_{t}$ , observe next state $s_{t + 1}$ and get $R_{t}$ .
8:
Store $< s_{t}, a_{t}, R_{t}, s_{t + 1} >$ in memory M.
9:
Sample random B experiences from M.
10:
$y_{t} = \{\begin{matrix} R_{t} if t = T, \\ R_{t} + \max_{P} γ Q (s_{t + 1}, P, {x_{d}}_{P} (s_{t + 1}; θ); ω_{t}) otherwise . \end{matrix}$
11:
$Compute \nabla_{ω} ℓ_{t}^{Q} (ω_{t}) and \nabla_{θ} ℓ_{t}^{Q} (θ) using {y_{t}, s_{t}, a_{t}}$ .
12:
update weights $ω_{t + 1} \overset{}{\leftarrow} ω_{t} - l r_{Q} \nabla_{ω} ℓ_{t}^{Q} (ω_{t})$ and $θ_{t + 1} \overset{}{\leftarrow} θ_{t} - l r_{x} \nabla_{θ} ℓ_{t}^{Q} (θ)$ .
13:
end for
14:
end for