Skip to main content
. 2021 Feb 15;21(4):1363. doi: 10.3390/s21041363
Algorithm 1: MDRLAT
1: Initialize experience replay buffer D to capacity N, parameters of online network θ, parameters of target network θ=θ, frequency Nt to update target network
2: for t=1, 2,, T do
3:   Obtain sk=lk,dk from the environment, with probability ε select a random action ak, otherwise select ak=argmaxaQsk,a;θ
4:   Execute action ak, transit to next state sk+1 and receive a reward rk
5:   Store transition sk,ak,rk,sk+1 in D
6:   Randomly sample a batch of NB transition st,at,rt,st+1 from D
7:   if episode terminates at step t+1 then
8:    Set yt=rt
9:   else
10:    Set yt=rt+γQ(st+1,argmaxat+1Qst+1,at+1;θ;θ)
11:   end if
12:   Compute loss LDRL=ytQst,at;θ2
13:   Feed st into the online network to get estimated velocity v˜, ω˜
14:   Compute loss LAT=i=t2tv¯iv˜i2+ω¯iω˜i2
15:   Update the parameters of the online network θ
16:   if t mod Nt=0 then
17:    Update the parameters of the target network θθ
18:   end if
19: end for