Dynamic Navigation and Area Assignment of Multiple USVs Based on Multi-Agent Deep Reinforcement Learning

. 2022 Sep 14;22(18):6942. doi: 10.3390/s22186942

Algorithm 1. Training algorithm using the MADDPG framework.

Parameters: batch size

β

, training episodes.

M

, training step

T

, action noise

N

and the number of USVs

N

, actor networks’ weights for each agent

θ_{i}^{μ^{’}}

1: For episode = 1 to

M

2: Initialize observations

O_{init}

O_{new} \leftarrow O_{init}

Initialize: Actor network

μ

, critic network

Q

with weights

θ^{μ}

θ^{Q}

Initialize: Target actor, critic network:

μ^{'}

Q^{'}

, with weights

θ^{μ^{'}} \leftarrow θ^{μ}

θ^{Q^{'}} \leftarrow θ^{Q}

Initialize: Replay buffer

D

with capacity

C

, exploration counter.

Counter = 0

3: for step

t = 1

T

4: if

Counter < C

then

5: each USV

i

randomly chooses

a_{i}

;

7: else

a_{i} = μ_{θ_{i}} (o_{new}^{i}) + N

9: end if

10: Execute actions

a = [a_{1}, a_{2}, \dots a_{N}]

, and observe reward

R

, new states

o_{n e w}

;

11: Store transition

(o, a, r, o_{new})

into experience replay buffer

D

;

12: Sample a mini-batch of

β

transitions

(o^{m}, a^{m}, r^{m}, o_{new}^{m})

from replay. Buffer

D

;

13: Set

y^{m} = R^{m} + {γ Q_{i}^{μ^{'}} (o_{new}^{m}, a_{1}^{'}, \dots, a_{N}^{'})|}_{a_{i}^{'} = μ_{i}^{'} (o_{i})}

14: Update critic by minimizing the loss

L (θ_{i}) = \frac{1}{S} \sum_{m = 1}^{S} {(y^{m} - Q_{i}^{μ} (o_{new}^{m}, a_{1}, \dots, a_{N}))}^{2}

15: Update actor using the sampled policy gradient:

\nabla_{θ_{i}} J = \frac{1}{S} \sum_{m = 1}^{S} {\nabla_{θ_{i}} μ_{θ_{i}} (o_{i}^{m}) \nabla_{a_{i}} Q_{i}^{μ} (o^{m}, a_{1}, a_{2}, \dots, a_{N})|}_{a_{i} = μ_{θ_{i}} (o_{i}^{m})}

16: Updating actor networks
Updating critic networks
Update target networks with updating rate

τ

θ_{i}^{μ^{'}} \leftarrow θ_{i}^{μ} + (1 - τ) θ_{i}^{μ^{'}}

θ_{i}^{Q^{'}} \leftarrow θ_{i}^{Q} + (1 - τ) θ_{i}^{Q^{'}}

17: end for

18: end for