Skip to main content
. 2022 Sep 14;22(18):6942. doi: 10.3390/s22186942
Algorithm 1. Training algorithm using the MADDPG framework.
Parameters: batch size β, training episodes. M, training step T, action noise N and the number of USVs N, actor networks’ weights for each agent θiμ
1: For episode = 1 to M do
2:  Initialize observations Oinit , Onew Oinit .
    Initialize: Actor network μ, critic network Q with weights θμ, θQ.
    Initialize: Target actor, critic network: μ, Q, with weights θμθμ, θQθQ.
    Initialize: Replay buffer D with capacity C, exploration counter. Counter =0
3:  for step t=1 to T do
4:   if Counter< C then
5:    each USV i randomly chooses ai;
6:
7:    else
8:     ai=μθionew i+N
9:    end if
10:    Execute actions a=a1,a2,aN, and observe reward R, new states onew;
11:    Store transition o,a,r,onew  into experience replay buffer D;
12:    Sample a mini-batch of β transitions om,am,rm,onewm from replay. Buffer D;
13:    Set ym=Rm+γQiμonewm,a1,,aNai=μioi
14:    Update critic by minimizing the loss
      Lθi=1Sm=1SymQiμonew m,a1,,aN2
15:    Update actor using the sampled policy gradient:
     ∇θiJ=1Sm=1SθiμθioimaiQiμom,a1,a2,,aNai=μθioim
16:    Updating actor networks
             Updating critic networks
             Update target networks with updating rate τ:
            θiμθiμ+(1τ)θiμ
            θiQθiQ+(1τ)θiQ
17:  end for
18: end for