Skip to main content
. 2023 Jun 27;23(13):5974. doi: 10.3390/s23135974
Algorithm 1 Proximal policy optimization (PPO) for trajectory planning
Require: Start θstart, end angle θend, hyperparameters of actor and critic networks, learning rate α, discount factor γ, generalized advantage estimation (GAE) parameter λ, clipping parameter ϵ, entropy coefficient cent, number of iterations N, number of epochs E, batch size B, episode length T, accuracy setting de.
  • 1:

    for n=1,2,,N do

  • 2:

          Collect T time steps transition {statrtst+1} using the running policy πθ

  • 3:

          if fa(θ)<de then

  • 4:

              TTcurrent

  • 5:

          end if

  • 6:

          Compute advantage estimates A^t

  • 7:

          for e=1,2,,E do

  • 8:

                Randomly sample a batch of B time steps

  • 9:

                Update the critic network parameters by minimizing the loss:

  • 10:

              L(ϕ)=1Bt=1B(Vϕ(st)R^t)2

  • 11:

              Update the actor network parameters using the PPO-clip objective and the entropy bonus:

  • 12:

              θk+1=argmaxθ1Bt=1BLCLIP(θ)+centH(πθ(·|st))

  • 13:

              LCLIP(θ)=minπθ(at|st)πθk(at|st)A^t,clipπθ(at|st)πθk(at|st),1ϵ,1+ϵA^t

  • 14:

          end for

  • 15:

    end for