|
Algorithm 1 Learning Algorithm |
-
1:
Initialize policy and value function parameters
-
2:
Set the maximum episode N and maximum step T
-
3:
repeat
-
4:
for i in do
-
5:
Randomly initialize the states of vehicle
-
6:
for t in do
-
7:
Making a decision according to
-
8:
Evaluate according to (18)
-
9:
Collect
-
10:
Save trajectory to memory buffer
-
11:
Randomly sample M trajectories from
-
12:
for in do
-
13:
set
-
14:
for to 0 do
-
15:
-
16:
-
17:
-
18:
-
19:
-
20:
-
21:
Update using Adam optimizer by
-
22:
-
23:
Update using Adam optimizer by
-
24:
until training success
|