| Algorithm 2. Proximal Policy Optimization (PPO) with Pre-Trained MN Reward |
| Input: Pre—date set , load the parameter of pre-trained MN |
| 1: Initialize PPO parameter and policy
|
| 2: For each iteration do |
| 3: While the robot does not reach the target position do |
| 4: Initializes the robot position |
| 5: Get state (including laser scan state and relative position ) |
| 6: Run generate action
|
| 7: Add to the queue
|
| 8: Input into MN |
| 9: Get reward from MN and environment |
| 10: Collect
|
| 11: Add and task result (Label) to the MN’s data set
|
| 12: If is a multiple of 3 then |
| 13: Update parameter of MN by with few epochs |
| 14: Compute estimated advantage
|
| 15: Update with K epochs |