Algorithm 2. Proximal Policy Optimization (PPO) with Pre-Trained MN Reward |
Input: Pre—date set , load the parameter of pre-trained MN |
1: Initialize PPO parameter and policy
|
2: For each iteration do |
3: While the robot does not reach the target position do |
4: Initializes the robot position |
5: Get state (including laser scan state and relative position ) |
6: Run generate action
|
7: Add to the queue
|
8: Input into MN |
9: Get reward from MN and environment |
10: Collect
|
11: Add and task result (Label) to the MN’s data set
|
12: If is a multiple of 3 then |
13: Update parameter of MN by with few epochs |
14: Compute estimated advantage
|
15: Update with K epochs |