|
Algorithm 1 Training process in the reward learning process
|
Input: number of full episodes , timesteps , fixed parameters , target firing rate , regularization hyper-parameters , , , bandwidth , predicted value function and sum of future rewards Output: total loss .
-
1.
Parameters setting: , , , .
-
2.
for n in batch size N:
-
3.
Set
-
4.
if number of literation is 0:
-
5.
-
6.
else:
where #{·} indicates counting the samples that satisfy the condition
-
7.
-
8.
=
-
9.
end for
-
10.
for k in K:
-
11.
for t in T:
-
12.
end for
-
13.
end for
-
14.
Calculate the total loss:
-
15.
return
|