|
Algorithm 1 DDPG-ID Algorithm. |
-
1:
Randomly initialize online Q network with weights
-
2:
Randomly initialize online policy network with weights
-
3:
Initialize the target Q network by
-
4:
Initialize the target policy network by
-
5:
Initialize the experience replay buffer
-
6:
Load the simplified micropositioner dynamic model
-
7:
for episode = 1, MaxEpisode do
-
8:
Initialize a noise process for exploration
-
9:
Initialize ASMDO and ID compensator
-
10:
Randomly initialize micropositioner states
-
11:
Receive initial observation state
-
12:
for step = 1, T do
-
13:
Select action
-
14:
Use to run micropositioner system model
-
15:
Process errors with integral differential compensator
-
16:
Receive reward and new state
-
17:
Store transition in replay buffer
-
18:
Randomly sample a minibatch of M transitions from
-
19:
Set
-
20:
Minimize loss: to update online Q network
-
21:
Use the sampled policy gradient to update online policy network:
-
22:
Update the target networks:
-
23:
end for
-
24:
end for
|