Skip to main content
. 2024 Apr 9;24(8):2386. doi: 10.3390/s24082386
Algorithm 2 MODDPG
  • Input: 

    State space srl

  • Output: 

    The action of UAV arl

  •   1:

    Initialize Actor network parameters θμ, and Target Actor network parameters θμ, θμθμ.

  •   2:

    Initialize Critic network parameters θQ, and Target Critic network parameters θQ, θQθQ.

  •   3:

    Initialize experience Replay Buffer.

  •   4:

    for episode=0to M do

  •   5:

       Initialize srl.

  •   6:

       for time step = t1,t2,,tn do

  •   7:

         repeat

  •   8:

            With probability of ϵ choose an action arl=clip(μ(srl|Qμ)+ϵ,alow,ahigh).

  •   9:

            Perform action arl and observe reward rrl and next state srl.

  • 10:

            Store the transition (srl,arl,rrl,srl) from P.

  • 11:

            if P ≥ Batch size then

  • 12:

               Randomly sample Mini batch transitions (srl,arl,rrl,srl) from P.

  • 13:

               Compute yrl (19).

  • 14:

               Update Critic network by minimizing the critic loss (20).

  • 15:

               Update Actor network by maximizing the actor loss (21).

  • 16:

               Soft Update Target Network Parameters.

  • 17:

               θQτθQ+(1τ)θQ

               θμτθμ+(1τ)θμ

  • 18:

            end if

  • 19:

         until n=0NtnT

  • 20:

       end for

  • 21:

    end for