Double Q-Learning for Radiation Source Detection

. 2019 Feb 24;19(4):960. doi: 10.3390/s19040960

Algorithm A1 Double Q-learning with CNN.

1:
Initialize reward function Q with weights $θ$
2:
Initialize target reward function $\hat{Q}$ with weights $\hat{θ}$
3:
Initialize replay memory D
4:
for episode = 1,M do
5:
Initialize states $ϕ_{1} = (s_{1}, n_{1}, m_{1})$
6:
for t = 1,T do
7:
Select action $a_{t}$ using $ϵ$ -greedy policy based on Q
8:
Execute action $a_{t}$ , obtain reward $r_{t}$ and state $s_{t + 1}$
9:
Update states $ϕ_{t + 1} = (s_{t + 1}, n_{t + 1}, m_{t + 1})$
10:
Store transition ${ϕ_{t}, a_{t}, r_{t}, ϕ_{t + 1}}$ in D
11:
Sample random minibatch of transitions ${ϕ_{j}, a_{j}, r_{j}, ϕ_{j + 1}}$ from D
12:
Set
$y_{j} = \{\begin{cases} r_{j} if episode terminates at step j + 1 \\ r_{j} + γ \hat{Q} (ϕ_{j + 1}, \underset{a}{a r g m a x} Q (ϕ_{j + 1}, a)) otherwise \end{cases}$
13:
Perform a gradient descent step on HuberLoss $(y_{j}, Q (ϕ_{j}, a_{j}; θ))$ for $θ$
14:
Every C steps reset $\hat{Q} = Q$
15:
Terminate the episode if source is found
16:
end for
17:
end for