Skip to main content
. 2023 Feb 2;23(3):1618. doi: 10.3390/s23031618
Algorithm 1 DDPG-based Learning Algorithm
1 Initialize weights θQ and θμ of critic network Qs,a|θQ and actor network μs|θμ
2 Initialize weights θQθQ, θμθμ of target network Q and μ
3 Initialize experience replay buffer R
4 for episode = 1, 2, …, M do
5 Receive initial observation state s1
6 for t = 1, 2, …, T do
7 Choose at=μst|θμ and do simulation using pandapower
8 Observe reward rt and the next state st+1
9 Store transition st,at,r1,st+1 in R
10 Sample a random minibatch of B transitions si,ai,ri,si+1 from R
11 Set yi=ri+γQsi+1,μsi+1|θμ|θQ according to Equation (26)
12 Update critic network parameters by minimizing the loss, see Equation (27):
     Loss=1BiyiQsi,ai|θQ2
13 Update the actor policy using the sampled policy gradient, see Equation (28):
     θμJ1BiaQs,aθQ|s=si,a=μsiθμμs|θμ|si
14 Softly update the target networks using the updated critic and actor network parameters:
     θQτθQ+1τθQ
     θμτθμ+1τθμ
15 end for
16 end for