Skip to main content
. 2022 Nov 11;7(4):197. doi: 10.3390/biomimetics7040197
Algorithm 1. SAC-Based Training Algorithm
  • 1.

    Initialize the learning rate of each network λV,λQ,λπ;

  • 2.

    Initialize the target network soft update rate τ, entropy regularization weights α;

  • 3.

    Initialize each network parameter θ,ϕ,ψ,ψ-;

  • 4.

    Initialize the replay buffer;

  • 5.
    For each episode:
    • (1)
      Initialize the UAV starting position;
    • (2)
      Reset various parameters in the interactive environment;
    • (3)
      Receive initial observation of the image state s0;
    • (4)
      For each time step t=1,2,3:
      1. Take the current state as the input of the current actor network, and generate actions at;
      2. Normalize the actions and convert them into speed control commands;
      3. Control the UAV with control commands and observe reward rt+1 and observe new image state st+1;
      4. Store the piece of experience (st,at,st+1,rt+1) into the replay buffer;
      5. Sample replayed experience (sb,ab,sb,rb) from the replay buffer;
      6. Update the behavior value network: ψψλV^ψJV(ψ);
      7. Update the Q network: ϕϕλQ^ϕJQ(ϕ);
      8. Update the policy network: θθλπ^θJπ(θ);
      9. Update the target value network: ψ¯τψ+(1τ)ψ¯;
      10. If the terminal condition is satisfied, start a new episode. Or, continue for next time step.

 The end of a time step;

The end of an episode;