| Algorithm 1: PPO, Actor–Critic Style [15] |
| for iteration 1,2, ... do |
| for actor=1,2, ... N do |
| Interact with the environment using the policy |
| feed the states to the critic network to calculate states |
| base estimate |
| Compute advantage estimates |
| end for |
| Optimize surrogate L wrt , with K epochs |
| update the policy |
| end for |