|
Algorithm 1 PPO. |
-
1:
Input: initial actor-network parameter and critic-network parameter .
-
2:
-
3:
for k = 1, 2, ... do
-
4:
-
5:
Run policy in the environment for T timesteps.
-
6:
-
7:
Compute rewards-to-go .
-
8:
-
9:
Compute advantage estimates based on the current critic network .
-
10:
-
11:
Update the actor network by maximizing the clipped objective function Equation (22) via gradient ascent with Adam [47]:
-
12:
-
13:
-
14:
-
15:
Update the critic network via gradient descent algorithm:
-
16:
-
17:
-
18:
end for
-
19:
|