Skip to main content
. 2022 Sep 15;22(18):7004. doi: 10.3390/s22187004
Algorithm 1: DQN-PTR to solve P3

01: Initialize the Q-Network Q with random weights θ

02: Initialize the Target Q-Network Q^ with weights θ=θ

03: Initialize replay memory D to capacity N

04: Initialize ϵ=0,ϵincrement=0.0001,ϵmax=0.9999

05: Initialize optimal trajectory O to empty and maximum total return R=0

06: For episode=1,M do

07:  Initialize sum reward R=0

08:  Initialize state s0, μ=rand[0,1]

09:  For each step t do

10:   If episode100 or μ>0.9 then

11:    If rand[0,1]ϵ then

12:     Select a random action at

13:    else

14:     Set at=argmaxaQ(st,a;θ)

15:    end if

16:    Execute action at, observe next state st+1 and reward rt according to environment 1

17:   else

18:    Set at according O[t]

19:    Execute action at, observe next state st+1 and reward rt according to environment 2

20:   end if

21:   R=R+rt

22:   Store transition (st,at,rt,st+1) in D

23:   If episode terminates at step t+1 then

24:    If R>R then

25:     Replace the trajectory in O with {a1,,at}

26:     R=R

27:    else if R=R and t<len(O) then

28:     Replace the trajectory in O with {a1,,at}

29:    end if

30:    break

31:   end if

32:   Sample random mini-batch of transitions (si,aj,rj,sj+1) from D

33:   Set yj=rj,ifepisodeterminatesatstepj+i,rj+γmaxaQ^(sj+1,a;θ),otherwise,

34:   Perform a gradient descent step on (yjQ(sj,aj;θ))2 with respect to the network parameters θ

35:   If ϵ<ϵmax then

36:    ϵ=ϵ+ϵincrement

37:   end if

38:   Every C steps reset Q^=Q

39:  end for

40: end for