Skip to main content
. 2018 Jan 30;9:5. doi: 10.3389/fpsyg.2018.00005

Algorithm 6.

Double deep Q learning for optimal control

1:  Initialize experience replay memory D to capacity J
2:  Initialize action-value function, Q with random weight, θ
3:  Initialize target action value function, Qtr with random weights, θp = θ.
4:  Learning rate, α = 0.001, discount rate, γ = 0.99
5:  Initialize ϵ = 0.99 for ϵ-greedy action selection.
6:  for eachEpisode = 1 to Maxepisode do
7:      Observe current state, s = {w, o}
8:      for eachsituation = 1 to P do
9:          Calculate Q using perceptron network with two hidden layers, Q = net(s, a, θ)
10:         With probability ϵ select random acrion a
11:         otherwise select a = argmaxaQ
12:         Observe s′, choose a′ based on maximum Q^-value, Q^=net(s,a,θp).
13:         Calculate reward r as described in Algorithm.1.
14:         Store transition (s, a, s′, a′, r) in D.
15:         Sample minibatch of transitions (sj, aj, sj+1,aj+1, rj,) from D
16:         Set Qtr=rj+γargmaxa(sj+1,a,θp)
17:         Perform gradient descent step on (Qtr-Q(sj,aj,θ))2 with respect to the network parameter θ
18:         After C step reset Qtr network with Q by setting θp = θ
19:         Calculate Qvalue for state s and chosen action, a with Qup = net(s, a, θ)
20:         Construct Q-matrix for word-object pairs, QW:
21:         for eachobject, jj = 1 to M do
22:               if label(o) = jj then
23:                   countcount + 1
24:                  QWjjQWjj+Qup-QWjjcount
25:               end if
26:         end for
27:         Set ss′;
28:      end for
29:      Decrease ϵ linearly.
30:  end for