Multi-Objective Optimal Trajectory Planning for Robotic Arms Using Deep Reinforcement Learning

. 2023 Jun 27;23(13):5974. doi: 10.3390/s23135974

Algorithm 1 Proximal policy optimization (PPO) for trajectory planning

Require: Start

θ_{s t a r t}

, end angle

θ_{e n d}

, hyperparameters of actor and critic networks, learning rate

α

, discount factor

γ

, generalized advantage estimation (GAE) parameter

λ

, clipping parameter

ϵ

, entropy coefficient

c_{e n t}

, number of iterations N, number of epochs E, batch size B, episode length T, accuracy setting

d_{e}

1:
for $n = 1, 2, \dots, N$ do
2:
Collect T time steps transition ${\begin{matrix} s_{t} & a_{t} & r_{t} & s_{t + 1} \end{matrix}}$ using the running policy $π_{θ}$
3:
if $f_{a} (θ) < d_{e}$ then
4:
$T \leftarrow T_{current}$
5:
end if
6:
Compute advantage estimates ${\hat{A}}_{t}$
7:
for $e = 1, 2, \dots, E$ do
8:
Randomly sample a batch of B time steps
9:
Update the critic network parameters by minimizing the loss:
10:
$L (ϕ) = \frac{1}{B} \sum_{t = 1}^{B} {(V_{ϕ} (s_{t}) - {\hat{R}}_{t})}^{2}$
11:
Update the actor network parameters using the PPO-clip objective and the entropy bonus:
12:
$θ_{k + 1} = arg {max}_{θ} \frac{1}{B} \sum_{t = 1}^{B} L^{C L I P} (θ) + c_{e n t} H (π_{θ} (\cdot | s_{t}))$
13:
$L^{C L I P} (θ) = min (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{k}} (a_{t} | s_{t})} {\hat{A}}_{t}, c l i p (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{k}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})$
14:
end for
15:
end for