Skip to main content
. 2020 Jun 30;20(13):3664. doi: 10.3390/s20133664
Algorithm 2. Proximal Policy Optimization (PPO) with Pre-Trained MN Reward
Input: Pre—date set L={(xi,yi)}i=1k, load the parameter χ of pre-trained MN
1: Initialize PPO parameter θ and policy πθ
2: For each iteration I do
3:  While the robot does not reach the target position do
4:    Initializes the robot position
5:    Get state st (including laser scan state ot and relative position dt)
6:    Run πθ generate action at
7:    Add st to the queue (s1,s2,,st1)I
8:    Input (s1,s2,,st1)I into MN
9:    Get reward rt from MN and environment
10:    Collect {st,at,rt}
11:  Add (s1,s2,,st1)I and task result (Label) to the MN’s data set L
12:  If I is a multiple of 3 then
13:    Update parameter χ of MN by L with few epochs
14:  Compute estimated advantage At
15:  Update θ with K epochs