|
Algorithm 1 Training process of the improved D3QN |
-
1:
Input: Total iteration rounds T, attenuation factor , and exploration rate .
-
2:
Initialize the evaluation network parameters , target network parameters , update frequency x, and replay buffer D;
-
3:
for to T do
-
4:
Randomly select actions according to the probability of in the optimized exploration space e, otherwise select actions according to the greedy policy;
-
5:
Taking action, the agent gets new status and reward;
-
6:
Store combination to D;
-
7:
Randomly select a batch of from D and calculate Q-Value of the evaluation network;
-
8:
Calculate loss and update parameters ;
-
9:
Every x steps reset , update the target Q-values, and reduce value;
-
10:
;
-
11:
end for
|