Skip to main content
. 2020 Sep 22;20(18):5443. doi: 10.3390/s20185443
Algorithm 1 DDPG algorithm
Randomly initialize critic network Q(s,a|θQ) and actor μ(s|θμ) with weights θθ and θμ
Initialize target network Q and μ with weights θQθQ, θμθμ
Initialize replay buffer R
for episode = 1, M do
   Initialize a random process N for action exploration
   Receive initial observation state s1
   for t = 1, T do
     Select action at=μ(st|θμ)+ Nt
     according to the current policy and exploration noise
     Execute action at and observe reward rt and observe new state st+1
     Store transition <st,at,rt,st+1> in R
     Sample a random minibatch of N transitions <si,ai,ri,si+1> from R
     Set yi=ri+γQ(si+1,μ(si+1|θμ)|θQ)
     Update critic by minimizing the loss: L=1Ni(yiQ(si,ai|θQ)2)
     Update the actor policy using the sampled gradient:
         θoμμ|si1NiaQ(s,a|θQ)|s=si,a=μ(si)θμμ(s|θμ)|s=si
    Update the target networks:
                 θQτθQ+(1τ)θQ
                 θμτθμ+(1τ)θμ
end for
end for