Skip to main content
. 2025 Mar 4;25(5):1565. doi: 10.3390/s25051565
Algorithm 1 Proximal Policy Optimization (PPO)

Require: Initialize the policy parameters θ and the value function parameters ϕ

  while not converged, do

   for each training iteration, do

    for each environment step, do

      st current state

      at action sampled from policy πθ

      rt reward received

      st+1 next state after executing at

      Store (st,at,rt,st+1) in buffer

    end for

    compute the advantage estimates A^t using Generalized Advantage Estimation (GAE)

    for each policy update, step do

      L(θ)Eminrt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t

      update policy θ using gradient ascent on L(θ)LV(ϕ)(Vϕ(st)Rt)2

      update value function ϕ by minimizing LV(ϕ)

    end for

   end for

  end while