|
Algorithm 3 PPO |
for each iteration do
for T time steps do
Run previous policy and retrieve (state, action, reward)
end for
Compute advantage estimates ,…, of each taken action
Optimize L function with regard to in K epochs and minibatch size
Update policy parameters:
end for
|