Skip to main content
. 2020 Dec 15;20(24):7176. doi: 10.3390/s20247176
Algorithm 1 PPO.
  • 1:

    Input: initial actor-network parameter θ0 and critic-network parameter ϕ0.

  • 2:

     

  • 3:

    for k = 1, 2, ... do

  • 4:

     

  • 5:

      Run policy πθk in the environment for T timesteps.

  • 6:

     

  • 7:

      Compute rewards-to-go R^t.

  • 8:

     

  • 9:

      Compute advantage estimates A^t based on the current critic network Vϕk.

  • 10:

     

  • 11:

      Update the actor network by maximizing the clipped objective function Equation (22) via gradient ascent with Adam [47]:

  • 12:

     

  • 13:

      θk+1=argmaxθLtCLIP+VF+S(θ)

  • 14:

     

  • 15:

      Update the critic network via gradient descent algorithm:

  • 16:

     

  • 17:

      ϕk+1=argmaxϕ(VϕR^t)2

  • 18:

     

  • end for

  • 19: