Skip to main content
. 2019 Mar 30;19(7):1547. doi: 10.3390/s19071547
Algorithm 1 Actor-dueling-critic algorithm
1: Initialize:
 Initialize actor μ(s|θμ) and dueling-critic Q(s,a|θQ,α,β)
 Initialize target actor μ with θμ=θμ and target dueling-critic Q with θQ=θQ,α=α,β=β
 Initialize replay memory R=, random process N.
 Uniformly separate the action space to n intervals (Z=z1,z2,,zn).
2: for episode=1 to M do
3:   Receive initial state s1
4:   for t=1 to N do
5:    With probability ϵ select action at=μ(st|θμ)+Nt, otherwise select at=μ(st|θμ)
6:    Execute at and observe reward rt and new state st+1
7:    Store transition (st,at,rt,st+1) in R
8:    Sample a random minibatch of N transitions (si,ai,ri,si+1) from R
9:    Implement target actor ai+1=μ(si+1|θμ)
10:    Implement dueling-critic Qi+1=Q(si+1,ai+1|θQ,α,β) (Equation (14)) with ai+1zj
11:    Set yi=ri+γQi+1 (set yi=ri if st+1 is terminal)
12:    Update dueling-critic by minimizing the loss:
         L=1Ni(yiQ(si,ai|θQ,α,β))2
13:    Update actor using the sampled PG:
         θμJ1NiaQ(s,a|θQ)|s=si,a=μ(si)θμμ(s|θμ))|si
14:    Soft update target networks of dueling-critic and actor (τ1):
         θμτθμ+(1τ)θμθQτθQ+(1τ)θQ
         ατα+(1τ)αβτβ+(1τ)β
15:   end for
16: end for