|
Algorithm 1 The UAV Maneuver Decision-Making Algorithm for Airdrop Task. |
Input:
The hyperparameters of training networks: the size of minibatch k, networks’ learning rate ;
The hyperparameters of updating policy: policy’s learning rate , learning period K, memory capacity N, “Soft” updating ;
The hyperparameters of sampling: the availability exponent of PER , IS exponent ;
The control parameters of simulation: maximum period M, maximum step per period T.
Output:
Actor network and its target network ;
Critic network and its target network .
-
1:
Initialize , and their target networks , .
-
2:
for to M
do
-
3:
Reset environment and read the initial state .
-
4:
Output according to Equation (18).
-
5:
for
to T
do
-
6:
Observe current state and reward of environment and calculate current action according to Equation (18).
-
7:
Save current transition into experiences memory D.
-
8:
if
then
-
9:
Reset the gradient of with IS.
-
10:
for
to k
do
-
11:
Sample traing data according to Equation (27)
-
12:
Calculate IS weight according to Equation (31)
-
13:
Calculate TD-error of training data according to Equation (22) and update its priority according to Equation (28)
-
14:
Accumulate according to Equation (30).
-
15:
end for
-
16:
Update the parameters of according to with learning rate .
-
17:
Update the parameters of according to Equation (26).
-
18:
Update the parameters of target networks and according to Equation (32)
-
19:
end if
-
20:
end for
-
21:
end for
|