Skip to main content
. 2023 Jul 17;23(14):6448. doi: 10.3390/s23146448
Algorithm 1: Multi pass DQN (MP-DQN) Algorithm.
Input: Probability distribution ξ, mini batch size B, exploration parameter ε, learning rates {αa,αa,p}.
Initialization: actor weights ω,ω and actor parameter weights (θ,θ)
For t = 1, 2, 3, T do
   Estimate the action parameters zj(s(t);θ(t)) by actor network
   Choose the action a(t)=(j,zj) based on the ε greedy policy:
      a(t)=radom sample according to probability distribution ξ,with ε(j,zj):j=argmaxcAdQ(s(t),j,zej;ω),with (1ε)
   Execute action a(t), receive immediate reward r(s(t),a(t)) and next state s(t+1)
   Save the experience (s(t),a(t),r(t),s(t+1)) into replay memory
   Select mini batch size B randomly from the replay memory
   Define the target y(t) by
       y(t)=r(t)+γmaxjAdQ(s(t+1),j,zj(s(t+1);θ);ω)
   Select the diagonal element from (Q11Q1cQc1Qcc)
   Choose the best action j by argmax from diagonal elements
   Use the (y(t),s(t),a(t)) to estimate the gradients 𝛻ωLx(ω) and 𝛻θLx(θ)
Update the weights parameters ω,ω,θ,θ