Skip to main content
. 2022 Nov 22;22(23):9036. doi: 10.3390/s22239036
Algorithm 1 DQN training procedure.
Input:  Total training episodes Γmax, experience pool capacity D, target network update frequency F, training batch size B, attenuation coefficient γ, greedy value ε and maximum total penalty Rmin.
Output:  Target neural network Q^ with ω^.
1: Initialization: set initial episode Γ=0, initial state st, total reward of each episode R=0, value neural network Q with random weight ω, target network Q^ with ω^=ω.
2: while Γ<Γmax do
3: while (x,y)(xd,yd)ϵ
    RRmin do
4:  Select an action through at=argmaxQ(st,at;ω).
5:  Execute at and obtain rt and st+1.
6:  Store (st,at,rt,st+1) into D.
7:  Randomly select B-size data from D, and perform
y=rt,Γ=Γmax,rt+γmaxQ^(st+1,at+1;ω^),else. (11)
8:  Compute mean square error loss through
loss=(yQ(st,at;ω))2. (12)
9:  Utilize gradient descent algorithm to update ω.
10:  After F steps, perform Q=Q^.
11:  Perform R=R+rt.
12: end while
13: Perform Γ=Γ+1 and R=0.
14: end while
15: return target neural network Q^ with ω^.