. 2021 Jan 25;7:566037. doi: 10.3389/frobt.2020.566037

TABLE 5.

Parameter table for training and simulation setup.

PPO	Description	Value
α	Learning rate	2.5e-4
γ	Discount rate	0.99
λ	GAE parameter	0.95
τ	Entropy bonus coefficient	0.001
T	Number of steps per policy updates	1,024
K	Number of epochs	4
M	Batch size	64
N	Number of parallel actors	4
Environment
$Δ$	Look-ahead distance	3
$n_{w}$	Number of training path waypoints	7
$γ_{a}$	Sonar span apex angle	140
$s_{r}$	Sonar range	25
$-$	Sensor suite	(15, 15)
$-$	Sensor min. pool output	(8, 8)
$-$	Sensor update frequency	1
$[V_{\min}, V_{\max}]$	Ocean current intensity limits	$[0.5, 1]$
$d_{a}$	End-goal acceptance radius	1
$T_{f}$	Control fins time constant	0.2
Reward function
$c_{χ}$	Course error penalty coefficient	$- 1$
$c_{υ}$	Elevation error penalty coefficient	$- 1$
$γ_{c}$	Obst. closen. penalty scaling	$- 12.5$
$ϵ_{c}$	Minimum obstacle penalty closeness	$- 5 e - 3$
$ϵ_{o a}$	Minimum vessel-relative scaling	$- 0.05$
$c_{ϕ}$	Roll penalty coefficient	$- 1$
$c_{r}$	Roll rate penalty coefficient	$- 1$
$c_{δ_{r}}$	Rudder action penalty coefficient	$- 0.1$
$c_{δ_{s}}$	Elevator action penalty coefficient	$- 0.1$
$λ_{r}$	path following/COLAV trade-off	$[0.9, 0.5, 0.1]$