|
Algorithm 2 The single-layer DDPG-based MRA training algorithm. |
-
1:
(Input) , batch size , actor learning rate , critic learning rate , decay rate d, discount factor , and soft update parameter ;
-
2:
(Output) Learned actor/critic to decide , for (7);
-
3:
Initialize actor , critic , action , replay buffer D, and set initial decay rate ;
-
4:
for episode = 1 to do
-
5:
Initialize state and ;
-
6:
for time to do
-
7:
Normalize state with (32);
-
8:
Execute action in (30), obtain reward with (23), and observe new state ;
-
9:
if replay buffer D is not full then
-
10:
Store transition in D;
-
11:
else
-
12:
Replace the oldest one in buffer D with ;
-
13:
Set ;
-
14:
Randomly choose stored transitions from D;
-
15:
Update the critic online network by minimizing the loss function in (36);
-
16:
Update the actor online network with the gradient obtained by (37);
-
17:
Soft update the target networks with their parameters updated by (29);
-
18:
;
-
19:
end if
-
20:
end for
-
21:
end for
|