General Purpose Low-Level Reinforcement Learning Control for Multi-Axis Rotor Aerial Vehicles

. 2021 Jul 2;21(13):4560. doi: 10.3390/s21134560

Algorithm 1 Learning Algorithm

1:
Initialize policy and value function parameters $θ_{π}, θ_{V}$
2:
Set the maximum episode N and maximum step T
3:
repeat
4:
for i in $[0, N - 1]$ do
5:
Randomly initialize the states $s_{0}$ of vehicle
6:
for t in $[0, T - 1]$ do
7:
Making a decision $a_{t}$ according to $π (a | s_{t})$
8:
Evaluate $s_{t + 1}$ according to (18)
9:
Collect $D_{t} = \{s_{t}, a_{t}, π (a_{t} | s_{t}), r_{t}, s_{t + 1}\}$
10:
Save trajectory $τ_{i} = {D_{0}, D_{1}, \dots, D_{T - 1}}$ to memory buffer $B$
11:
Randomly sample M trajectories $\{τ\}$ from $B$
12:
for $τ$ in $\{τ\}$ do
13:
set $t m p = 0$
14:
for $j = T - 1$ to 0 do
15:
$Q_{j} = r_{j} + γ \hat{V} (s_{j + 1})$
16:
$A_{j} = Q_{j} - \hat{V} (s_{j})$
17:
$A_{j}^{t r a c e} = A_{j} + γ \times t m p$
18:
$t m p = min (1, \frac{π (a_{t} | s_{t}; θ)}{π (a_{t} | s_{t}; θ_{o l d})}) A_{j}^{t r a c e}$
19:
$V_{j}^{t r a c e} = \hat{V_{j}} + t m p$
20:
${\hat{g}}_{p o l i c y} = \frac{1}{T} \sum_{j = 0}^{T - 1} \nabla L_{p o l i c y, j}$
21:
Update $θ_{π}$ using Adam optimizer by ${\hat{g}}_{p o l i c y}$
22:
${\hat{g}}_{v a l u e} = \frac{1}{T} \sum_{j = 0}^{T - 1} \nabla L_{v a l u e, j}$
23:
Update $θ_{V}$ using Adam optimizer by ${\hat{g}}_{v a l u e}$
24:
until training success