Multimodal Deep Reinforcement Learning with Auxiliary Task for Obstacle Avoidance of Indoor Mobile Robot

. 2021 Feb 15;21(4):1363. doi: 10.3390/s21041363

Algorithm 1: MDRLAT

1: Initialize experience replay buffer

D

to capacity

N

, parameters of online network

θ

, parameters of target network

θ^{-} = θ

, frequency

N_{t}

to update target network

2: for

t = 1, 2, \dots, T

3: Obtain

s_{k} = [l_{k}, d_{k}]

from the environment, with probability

ε

select a random action

a_{k}

, otherwise select

a_{k} = \underset{a}{argmax} Q (s_{k}, a; θ)

4: Execute action

a_{k}

, transit to next state

s_{k + 1}

and receive a reward

r_{k}

5: Store transition

(s_{k}, a_{k}, r_{k}, s_{k + 1})

D

6: Randomly sample a batch of

N_{B}

transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

from

D

7: if episode terminates at step

t + 1

then

8: Set

y_{t} = r_{t}

9: else

10: Set

y_{t} = r_{t} + γ Q (s_{t + 1}, \underset{a_{t + 1}}{argmax} Q (s_{t + 1}, a_{t + 1}; θ); θ^{-})

11: end if

12: Compute loss

L_{D R L} = {(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}

13: Feed

s_{t}

into the online network to get estimated velocity

\tilde{v}

\tilde{ω}

14: Compute loss

L_{A T} = \sum_{i = t - 2}^{t} ({({\bar{v}}_{i} - {\tilde{v}}_{i})}^{2} + {({\bar{ω}}_{i} - {\tilde{ω}}_{i})}^{2})

15: Update the parameters of the online network

θ

16: if

t

mod

N_{t} = 0

then

17: Update the parameters of the target network

θ^{-} \leftarrow θ

18: end if

19: end for