Algorithm 2 DDPG with HER and Adaptive Exploration |
-
1:
Initialize online policy network with weights
-
2:
Initialize target policy network with weights
-
3:
Initialize online Q network with weights
-
4:
Initialize target Q network with weights
-
5:
Initialize experience replay pool to capacity N
-
6:
for episode = 1, M do
-
7:
Receive initial observation state
-
8:
for t = 1, T do
-
9:
Obtain from the adaptive exploration adjustment unit
-
10:
Calculate exploration noise
-
11:
Select action
-
12:
Execute action in the environment
-
13:
Observe reward and new state
-
14:
Store transition in
-
15:
Store modified transitions with alternative goals
-
16:
Sample random mini-batch of K transitions from
-
17:
Set
-
18:
Update by minimizing the loss:
-
19:
Update using the sampled policy gradient:
-
20:
Update the target networks:
-
21:
end for
-
22:
end for
|