A Novel Reinforcement Learning Collision Avoidance Algorithm for USVs Based on Maneuvering Characteristics and COLREGs

View full-text article in PMC

. 2022 Mar 8;22(6):2099. doi: 10.3390/s22062099

Algorithm 1 RLCA algorithm code

Initialize the training environment

Initialize the replay memory buffer to capacity D

Initialize the evaluate network with random weight $θ$

Initialize the target network with random weight $θ^{-} = θ$

For episode = 1, M do

Initialize the initial position of the own USV and the obstacle USVs

While true

update the training environment

With probability $ε$ , select USV action $a_{t} \in A_{t}$

Otherwise, select USV action $a_{t} = \underset{a}{arg max} Q (s^{'}, \underset{a}{arg max} Q (s^{'}, a; θ); θ^{-})$

Execute action $a_{t}$ in the training environment, and obtain $s_{t + 1}$

Obtain reward $R_{a l l} = R_{g o a l} + R_{c o l l i s i o n} + R_{C O L R E G s} + R_{φ} + R_{Δ φ}$ via maneuvering and the COLREGs

Obtain the $s_{t + 1}$ category, and add one in $n (S)$ by the method of category-based exploration

Obtain a reward $R_{exp l o r e} = \frac{σ}{\sqrt{n (S) + d_{t}}}$ based on category-based exploration

Obtain the total reward $R_{i m p r o v e} = R_{a l l} + R_{exp l o r e}$

Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay memory buffer D

Sample the random minibatch of transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from D

Obtain $y_{i} = \{\begin{matrix} r_{j}, i f j + 1 i s t h e t e r m i n a l \\ r_{j} + γ max_{a^{'}} Q (s_{j}, a^{'}, θ^{-}), o t h e r w i s e \end{matrix}$

Update the evaluate network parameters $θ$ with gradient descent

If the number of steps reaches the update step N

update the target network with weight $θ^{-} = θ$

End if

The number of steps plus 1

End while

Return the weight $θ^{*} = θ^{-}$