Learning Reward Function with Matching Network for Mapless Navigation

. 2020 Jun 30;20(13):3664. doi: 10.3390/s20133664

Algorithm 2. Proximal Policy Optimization (PPO) with Pre-Trained MN Reward

Input: Pre—date set

L^{'} = {(x_{i}, y_{i})}_{i = 1}^{k}

, load the parameter

χ

of pre-trained MN

1: Initialize PPO parameter

θ

and policy

π_{θ}

2: For each iteration

I

3: While the robot does not reach the target position do

4: Initializes the robot position

5: Get state

s_{t}

(including laser scan state

o_{t}

and relative position

d_{t}

)

6: Run

π_{θ}

generate action

a_{t}

7: Add

s_{t}

to the queue

{(s_{1}, s_{2}, \dots, s_{t - 1})}_{I}

8: Input

{(s_{1}, s_{2}, \dots, s_{t - 1})}_{I}

into MN

9: Get reward

r_{t}

from MN and environment

10: Collect

{s_{t} {, a}_{t} {, r}_{t}}

11: Add

{(s_{1}, s_{2}, \dots, s_{t - 1})}_{I}

and task result (Label) to the MN’s data set

L^{'}

12: If

I

is a multiple of 3 then

13: Update parameter

χ

of MN by

L^{'}

with few epochs

14: Compute estimated advantage

A_{t}

15: Update

θ

with K epochs