. 2018 Jan 30;9:5. doi: 10.3389/fpsyg.2018.00005

Algorithm 6.

Double deep Q learning for optimal control

1: Initialize experience replay memory D to capacity J

2: Initialize action-value function, Q with random weight, θ

3: Initialize target action value function, Q_tr with random weights, θ^p = θ.

4: Learning rate, α = 0.001, discount rate, γ = 0.99

5: Initialize ϵ = 0.99 for ϵ-greedy action selection.

6: for eachEpisode = 1 to Maxepisode do

7: Observe current state, s = {w, o}

8: for eachsituation = 1 to P do

9: Calculate Q using perceptron network with two hidden layers, Q = net(s, a, θ)

10: With probability ϵ select random acrion a

11: otherwise select a = argmax_aQ

12: Observe s′, choose a′ based on maximum

\hat{Q}

-value,

\hat{Q} = n e t (s^{'}, a^{'}, θ^{p})

13: Calculate reward r as described in Algorithm.1.

14: Store transition (s, a, s′, a′, r) in D.

15: Sample minibatch of transitions (s_j, a_j, s_j+1,a_j+1, r_j,) from D

16: Set

Q_{t r} = r_{j} + γ \underset{a^{'}}{arg max} (s_{j + 1}, a^{'}, θ^{p})

17: Perform gradient descent step on

{(Q_{t r} - Q (s_{j}, a_{j}, θ))}^{2}

with respect to the network parameter θ

18: After C step reset Q_tr network with Q by setting θ^p = θ

19: Calculate Q − value for state s and chosen action, a with Q_up = net(s, a, θ)

20: Construct Q-matrix for word-object pairs, QW:

21: for eachobject, jj = 1 to M do

22: if label(o) = jj then

23: count ← count + 1

24:

Q W_{j j} \leftarrow Q W_{j j} + \frac{Q_{u p} - Q W_{j j}}{c o u n t}

25: end if

26: end for

27: Set s ← s′;

28: end for

29: Decrease ϵ linearly.

30: end for