|
Algorithm 1 Proximal policy optimization (PPO) for trajectory planning |
Require: Start , end angle , hyperparameters of actor and critic networks, learning rate , discount factor , generalized advantage estimation (GAE) parameter , clipping parameter , entropy coefficient , number of iterations N, number of epochs E, batch size B, episode length T, accuracy setting .
-
1:
for do
-
2:
Collect T time steps transition using the running policy
-
3:
if then
-
4:
-
5:
end if
-
6:
Compute advantage estimates
-
7:
for do
-
8:
Randomly sample a batch of B time steps
-
9:
Update the critic network parameters by minimizing the loss:
-
10:
-
11:
Update the actor network parameters using the PPO-clip objective and the entropy bonus:
-
12:
-
13:
-
14:
end for
-
15:
end for
|