Table - PMC

Skip to main content

View full-text article in PMC

. 2022 Nov 11;7(4):197. doi: 10.3390/biomimetics7040197

Algorithm 1. SAC-Based Training Algorithm

1.
Initialize the learning rate of each network $λ_{V}, λ_{Q}, λ_{π}$ ;

2.
Initialize the target network soft update rate $τ$ , entropy regularization weights $α$ ;

3.
Initialize each network parameter $θ, ϕ, ψ, \bar{ψ}$ ;

4.
Initialize the replay buffer;

5.
For each episode:
- (1)
  Initialize the UAV starting position;
- (2)
  Reset various parameters in the interactive environment;
- (3)
  Receive initial observation of the image state $s_{0}$ ;
- (4)
  For each time step $t = 1, 2, 3 \dots$ :
  1. Take the current state as the input of the current actor network, and generate actions $a_{t}$ ;
  2. Normalize the actions and convert them into speed control commands;
  3. Control the UAV with control commands and observe reward $r_{t + 1}$ and observe new image state $s_{t + 1}$ ;
  4. Store the piece of experience $(s_{t}, a_{t}, s_{t + 1}, r_{t + 1})$ into the replay buffer;
  5. Sample replayed experience $(s_{b}, a_{b}, s_{b}^{'}, r_{b})$ from the replay buffer;
  6. Update the behavior value network: $ψ \leftarrow ψ - λ_{V} {\hat{\nabla}}_{ψ} J_{V} (ψ)$ ;
  7. Update the Q network: $ϕ \leftarrow ϕ - λ_{Q} {\hat{\nabla}}_{ϕ} J_{Q} (ϕ)$ ;
  8. Update the policy network: $θ \leftarrow θ - λ_{π} {\hat{\nabla}}_{θ} J_{π} (θ)$ ;
  9. Update the target value network: $\bar{ψ} \leftarrow τ ψ + (1 - τ) \bar{ψ}$ ;
  10. If the terminal condition is satisfied, start a new episode. Or, continue for next time step.

The end of a time step;

The end of an episode;