Skip to main content
. 2023 Oct 27;23(21):8766. doi: 10.3390/s23218766
Algorithm 3 PPO
  • for each iteration do

  •     for T time steps do

  •         Run previous policy πθold and retrieve (state, action, reward)

  •     end for

  •     Compute advantage estimates A^1,,A^T of each taken action

  •     Optimize L function with regard to θ in K epochs and minibatch size <T

  •     Update policy parameters: θoldθ

  • end for