| Algorithm 1 Proximal Policy Optimization (PPO) |
|
Require: Initialize the policy parameters and the value function parameters while not converged, do for each training iteration, do for each environment step, do current state action sampled from policy reward received next state after executing Store in buffer end for compute the advantage estimates using Generalized Advantage Estimation (GAE) for each policy update, step do
update policy using gradient ascent on update value function by minimizing end for end for end while |