Skip to main content
. 2020 Mar 29;20(7):1890. doi: 10.3390/s20071890
Algorithm 1: MEP–DDPG Algorithm
Randomly initialize critic network Q(s,a|θQ) and actor μ(s|θμ) with weights θQ and θμ
Initialize target network Q and μ with weights θQθQ, θμθμ
Initialize experience pools R1,R2RX
for episode = 1, M do
 Initialize a random process N for action exploration
Receive initial observation state s1
fort=1,Tdo
 According to probability η, select action at=μ(st|θμ)+Nt based on current policy and exploration noise or select action at by MPC-SA
 Execute action at and observe reward rt and observe new state st+1
 Store experience (st,at,rt,st+1) in specific R depending on the source of at.
 Follow a specific sampling policy, sample a random minibatch of N experiences (si,ai,ri,si+1) from R1,R2RX
 Set
yi=ri+γQ(si+1,μ(si+1|θμ)|θQ)
 Update critic by minimizing the loss:
L=1Ni(yiQ(si,ai|θQ))2
 Update the actor policy using the sampled policy gradient:
θμJ1NiaQ(s,a|θQ)|s=si,a=μ(si)θμμ(s|θμ)|si
 Update the target networks:
θQτθQ+(1τ)θQ
θμτθμ+(1τ)θμ
end for
end for