|
Algorithm A1 Double Q-learning with CNN. |
-
1:
Initialize reward function Q with weights
-
2:
Initialize target reward function with weights
-
3:
Initialize replay memory D
-
4:
for episode = 1,M do
-
5:
Initialize states
-
6:
for t = 1,T do
-
7:
Select action using -greedy policy based on Q
-
8:
Execute action , obtain reward and state
-
9:
Update states
-
10:
Store transition in D
-
11:
Sample random minibatch of transitions from D
-
12:
-
13:
Perform a gradient descent step on HuberLoss for
-
14:
Every C steps reset
-
15:
Terminate the episode if source is found
-
16:
end for
-
17:
end for
|