Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

View full-text article in PMC

. 2022 Mar 17;13(3):458. doi: 10.3390/mi13030458

Algorithm 1 DDPG-ID Algorithm.

1:
Randomly initialize online Q network with weights $w^{Q}$
2:
Randomly initialize online policy network with weights $w^{μ}$
3:
Initialize the target Q network by $w^{Q^{'}} \leftarrow w^{Q}$
4:
Initialize the target policy network by $w^{μ^{'}} \leftarrow w^{μ}$
5:
Initialize the experience replay buffer $Ψ$
6:
Load the simplified micropositioner dynamic model
7:
for episode = 1, MaxEpisode do
8:
Initialize a noise process $N$ for exploration
9:
Initialize ASMDO and ID compensator
10:
Randomly initialize micropositioner states
11:
Receive initial observation state $s_{1}$
12:
for step = 1, T do
13:
Select action $a_{t} = π_{O} (s_{t}) + {\hat{D}}_{t} + N_{t}$
14:
Use $a_{t}$ to run micropositioner system model
15:
Process errors with integral differential compensator
16:
Receive reward $r_{t}$ and new state $s_{t + 1}$
17:
Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer $Ψ$
18:
Randomly sample a minibatch of M transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from $Ψ$
19:
Set $Q_{T} = r_{j} + γ Q_{T}^{'} (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}), w^{Q^{'}})$
20:
Minimize loss: $L (w^{Q}) = \frac{1}{M} \sum_{j = 1}^{M} {(Q_{T} - Q_{O} (s_{j}, a_{j}, w^{Q}))}^{2}$ to update online Q network
21:
Use the sampled policy gradient to update online policy network:

$\nabla_{w^{μ}} J = \frac{1}{M} \sum_{j}^{M} (\nabla_{a_{j}} Q_{O} (s_{j}, a_{j}, w^{Q}) \nabla_{w^{μ}} π_{O} (s_{j}, w^{μ}))$
22:
Update the target networks: $w^{Q^{'}} \leftarrow τ w^{Q} + (1 - τ) w^{Q^{'}}, w^{μ^{'}} \leftarrow τ w^{μ} + (1 - τ) w^{μ^{'}}$
23:
end for
24:
end for