| Algorithm 1. Training algorithm using the MADDPG framework. |
| Parameters: batch size , training episodes. , training step , action noise and the number of USVs , actor networks’ weights for each agent |
| 1: For episode = 1 to do |
| 2: Initialize observations , . |
| Initialize: Actor network , critic network with weights , . |
| Initialize: Target actor, critic network: , , with weights , . |
| Initialize: Replay buffer with capacity , exploration counter. |
| 3: for step to do |
| 4: if then |
| 5: each USV randomly chooses ; |
| 6: |
| 7: else |
| 8: |
| 9: end if |
| 10: Execute actions , and observe reward , new states ; |
| 11: Store transition into experience replay buffer ; |
| 12: Sample a mini-batch of transitions from replay. Buffer ; |
| 13: Set |
| 14: Update critic by minimizing the loss |
| 15: Update actor using the sampled policy gradient: |
| 16: Updating actor networks Updating critic networks Update target networks with updating rate : |
| 17: end for |
| 18: end for |