Path Following Control for Underactuated Airships with Magnitude and Rate Saturation

. 2020 Dec 15;20(24):7176. doi: 10.3390/s20247176

Algorithm 1 PPO.

1:
Input: initial actor-network parameter $θ_{0}$ and critic-network parameter $ϕ_{0}$ .
2:
3:
for k = 1, 2, ... do
4:
5:
Run policy $π_{θ_{k}}$ in the environment for T timesteps.
6:
7:
Compute rewards-to-go ${\hat{R}}_{t}$ .
8:
9:
Compute advantage estimates ${\hat{A}}_{t}$ based on the current critic network $V_{ϕ_{k}}$ .
10:
11:
Update the actor network by maximizing the clipped objective function Equation (22) via gradient ascent with Adam [47]:
12:
13:
$θ_{k + 1} = \underset{θ}{arg max} L_{t}^{C L I P + V F + S} (θ)$
14:
15:
Update the critic network via gradient descent algorithm:
16:
17:
$ϕ_{k + 1} = \underset{ϕ}{arg max} {(V_{ϕ} - {\hat{R}}_{t})}^{2}$
18:
end for
19: