Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay

View full-text article in PMC

. 2020 Oct 19;20(20):5911. doi: 10.3390/s20205911

Algorithm 1: Proposed SAC-based path planning algorithm for multi-arm manipulator.

1:
Define MAMMDP and the augmented state $q_{t}$ and the state and goal state $q_{init}$ and $q_{goal}$
2:
Initialize network parameters $ψ, θ_{1, 2}, ϕ$
3:
Initialize the parameter values of the target network $\bar{ψ} \leftarrow ψ$
4:
Initialize global replay memory $D$
5:
6:
for $e = 1$ to Mdo
7:
Initialize local buffer $L$ ▹ Memory for an episode
8:
for $t = 0$ to $T - 1$ do
9:
Randomly choose the goal and initial positions $q_{goal}, q_{init} \in Q_{free}^{a}$
10:
$a_{t} = f_{ϕ} (ϵ_{t}, q_{t} | | q_{goal}), ϵ_{t} \sim N (0, σ_{t})$
11:
${\hat{q}}_{t + 1} = q_{t} + α \cdot a_{t} + ϵ_{e}, ϵ_{e} \sim N (0, σ_{e})$
12:
13:
if ${\hat{q}}_{t + 1} \in Q_{free}^{a}$ then ▹ Get next state and reward
14:
$q_{t + 1} \leftarrow {\hat{q}}_{t + 1}$
15:
$r_{t + 1} = - 1$
16:
else if ${\hat{q}}_{t + 1} \in Q_{collide}^{a}$ then
17:
$q_{t + 1} \leftarrow q_{t}$
18:
$r_{t + 1} = - 1$
19:
else if $| q_{t + 1} - q_{goal} | \leq η \cdot α$ then
20:
$r_{t + 1} = 0$
21:
Terminate due to goal arrival
22:
end if
23:
24:
Store the transition $(q_{t} | | q_{goal}, a_{t}, r_{t + 1}, q_{t + 1} | | q_{goal})$ in $D, L$
25:
▹ Parameters update
26:
Sample mini-batch of m transitions $(q_{l} | | q_{goal}, a_{l}, r_{l + 1}, q_{l + 1} | | q_{goal})$ from $D$
27:
$J_{V} (ψ) = E_{q_{l}} [\frac{1}{2} (V_{ψ} (q_{l} | | q_{goal}) - E_{a_{l}} [{min}_{k = 1, 2} Q_{θ_{k}} (q_{l} | | q_{goal}, a_{l}) - β log π_{ϕ} (a_{l} | q_{l} | | q_{goal} {)])}^{2}]$
28:
$J_{Q} (θ_{k = 1, 2}) = E_{q_{l}, a_{l}} [\frac{1}{2} (Q_{θ_{k = 1, 2}} (q_{l} | | q_{goal}, a_{l}) - (r_{l + 1} + V_{\bar{ψ}} (q_{l + 1} | | q_{goal} {)))}^{2}]$
29:
$J_{π} (ϕ) = E_{q_{l}, a_{l}} [β log π_{ϕ} (a_{l} | q_{l} | | q_{goal}) - {min}_{k = 1, 2} Q_{θ_{k}} (q_{l} | | q_{goal}, a_{l})]$
30:
31:
Each network parameters $ψ, θ_{1, 2}, ϕ$ are updated by gradient descent
32:
using $\nabla_{ψ} J_{V} (ψ), \nabla_{θ_{1}} J_{Q} (θ_{1}), \nabla_{θ_{2}} J_{Q} (θ_{2}), \nabla_{ϕ} J_{π} (ϕ)$
33:
34:
Update state value target $\bar{ψ} \leftarrow τ ψ + (1 - τ) \bar{ψ}$
35:
end for
36:
37:
if $q_{T} \neq q_{goal}$ then ▹ HER
38:
Set additional goal $q_{goal}^{'} \in {q_{1}, q_{2}, \dots, q_{T}}$
39:
for $t = 0$ to $T - 1$ do
40:
Sample a transition $(q_{t} | | q_{goal}, a_{t}, r_{t}, q_{t + 1} | | q_{goal})$ from $L$
41:
if $| q_{t + 1} - q_{goal}^{'} | \leq η \cdot α$ then
42:
$r_{t + 1}^{'} = 0$
43:
else $r_{t + 1}^{'} = - 1$
44:
end if
45:
Store the transition $(q_{t} | | q_{goal}^{'}, a_{t}, r_{t + 1}^{'}, q_{t + 1} | | q_{goal}^{'})$ in $D$
46:
end for
47:
end if
48:
end for