1: Initialize experience replay memory D to capacity J
|
2: Initialize action-value function, Q with random weight, θ |
3: Initialize target action value function, Qtr with random weights, θp = θ. |
4: Learning rate, α = 0.001, discount rate, γ = 0.99 |
5: Initialize ϵ = 0.99 for ϵ-greedy action selection. |
6: for
eachEpisode = 1 to Maxepisode
do
|
7: Observe current state, s = {w, o} |
8: for
eachsituation = 1 to P
do
|
9: Calculate Q using perceptron network with two hidden layers, Q = net(s, a, θ) |
10: With probability ϵ select random acrion a
|
11: otherwise select a = argmaxaQ
|
12: Observe s′, choose a′ based on maximum -value, . |
13: Calculate reward r as described in Algorithm.1. |
14: Store transition (s, a, s′, a′, r) in D. |
15: Sample minibatch of transitions (sj, aj, sj+1,aj+1, rj,) from D
|
16: Set
|
17: Perform gradient descent step on with respect to the network parameter θ |
18: After C step reset Qtr network with Q by setting θp = θ |
19: Calculate Q − value for state s and chosen action, a with Qup = net(s, a, θ) |
20: Construct Q-matrix for word-object pairs, QW: |
21: for
eachobject, jj = 1 to M
do
|
22: if
label(o) = jj
then
|
23: count ← count + 1 |
24:
|
25: end if
|
26: end for
|
27: Set s ← s′; |
28: end for
|
29: Decrease ϵ linearly. |
30: end for
|