The Actor-Dueling-Critic Method for Reinforcement Learning

. 2019 Mar 30;19(7):1547. doi: 10.3390/s19071547

Algorithm 1 Actor-dueling-critic algorithm

1: Initialize:
Initialize actor

μ (s | θ^{μ})

and dueling-critic

Q (s, a | θ^{Q}, α, β)

Initialize target actor

μ^{'}

with

θ^{μ^{'}} = θ^{μ}

and target dueling-critic

Q^{'}

with

θ^{Q^{'}} = θ^{Q}, α^{'} = α, β^{'} = β

Initialize replay memory

R = \emptyset

, random process

N

.
Uniformly separate the action space to n intervals (

Z = \{z_{1}, z_{2}, \dots, z_{n}\}

).
2: for episode=1 to M do
3: Receive initial state

s_{1}

4: for t=1 to N do
5: With probability

ϵ

select action

a_{t} = μ (s_{t} | θ^{μ}) + N_{t}

, otherwise select

a_{t} = μ (s_{t} | θ^{μ})

6: Execute

a_{t}

and observe reward

r_{t}

and new state

s_{t + 1}

7: Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in R
8: Sample a random minibatch of N transitions

(s_{i}, a_{i}, r_{i}, s_{i + 1})

from R
9: Implement target actor

a_{i + 1}^{'} = μ^{'} (s_{i + 1} | θ^{μ^{'}})

10: Implement dueling-critic

Q_{i + 1}^{'} = Q^{'} (s_{i + 1}, a_{i + 1}^{'} | θ^{Q^{'}}, α^{'}, β^{'})

(Equation (14)) with

a_{i + 1}^{'} \in z_{j}

11: Set

y_{i} = r_{i} + γ Q_{i + 1}^{'}

(set

y_{i} = r_{i}

s_{t + 1}

is terminal)
12: Update dueling-critic by minimizing the loss:

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}, α, β))}^{2}

13: Update actor using the sampled PG:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) {) |}_{s_{i}}

14: Soft update target networks of dueling-critic and actor (

τ ≪ 1

\begin{matrix} θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} & θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \end{matrix}

\begin{matrix} α^{'} \leftarrow τ α + (1 - τ) α^{'} & β^{'} \leftarrow τ β + (1 - τ) β^{'} \end{matrix}

15: end for
16: end for