Algorithm 1 DDPG-based Learning Algorithm |
1 |
Initialize weights and of critic network and actor network
|
2 |
Initialize weights , of target network and
|
3 |
Initialize experience replay buffer
|
4 |
for = 1, 2, …, do
|
5 |
|
Receive initial observation state
|
6 |
|
for = 1, 2, …, do
|
7 |
|
|
Choose and do simulation using pandapower |
8 |
|
|
Observe reward and the next state
|
9 |
|
|
Store transition in
|
10 |
|
|
Sample a random minibatch of transitions from
|
11 |
|
|
Set according to Equation (26) |
12 |
|
|
Update critic network parameters by minimizing the loss, see Equation (27):
|
13 |
|
|
Update the actor policy using the sampled policy gradient, see Equation (28):
|
14 |
|
|
Softly update the target networks using the updated critic and actor network parameters:
|
15 |
|
end for |
16 |
end for |