Skip to main content
. 2021 Mar 23;21(6):2233. doi: 10.3390/s21062233
Algorithm 1 The UAV Maneuver Decision-Making Algorithm for Airdrop Task.
  • Input:
    • The hyperparameters of training networks: the size of minibatch k, networks’ learning rate η;
    • The hyperparameters of updating policy: policy’s learning rate σ, learning period K, memory capacity N, “Soft” updating τ;
    • The hyperparameters of sampling: the availability exponent of PER α, IS exponent β;
    • The control parameters of simulation: maximum period M, maximum step per period T.
  • Output:
    • Actor network Q(sj,aj;θQ) and its target network Q(s,a;θQ);
    • Critic network μs;θμ and its target network μ(s;θμ).
  • 1:

    Initialize Q(sj,aj;θQ), μs;θμ and their target networks Q(s,a;θQ), μ(s;θμ).

  • 2:

    form=1 to M do

  • 3:

     Reset environment and read the initial state s0.

  • 4:

     Output a0 according to Equation (18).

  • 5:

    for t=1 to T do

  • 6:

      Observe current state st and reward rt of environment and calculate current action at according to Equation (18).

  • 7:

      Save current transition st,at,rt,st+1 into experiences memory D.

  • 8:

      if tmodK0 then

  • 9:

       Reset the gradient Δ=0 of Q(sj,aj;θQ) with IS.

  • 10:

       for j=0 to k do

  • 11:

        Sample traing data jPj according to Equation (27)

  • 12:

        Calculate IS weight ωj according to Equation (31)

  • 13:

        Calculate TD-error δj of training data according to Equation (22) and update its priority according to Equation (28)

  • 14:

        Accumulate Δ according to Equation (30).

  • 15:

       end for

  • 16:

       Update the parameters of Q(sj,aj;θQ) according to Δ with learning rate η.

  • 17:

       Update the parameters of μs;θμ according to Equation (26).

  • 18:

       Update the parameters of target networks Q(s,a;θQ) and μ(s;θμ) according to Equation (32)

  • 19:

      end if

  • 20:

    end for

  • 21:

    end for