Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

Shiyun Liang; Ruidong Xi; Xiao Xiao; Zhixin Yang

doi:10.3390/mi13030458

. 2022 Mar 17;13(3):458. doi: 10.3390/mi13030458

Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

Shiyun Liang ^1,^†, Ruidong Xi ^1,^†, Xiao Xiao ², Zhixin Yang ^1,^*

Editor: Duc Truong Pham

PMCID: PMC8955352 PMID: 35334749

Abstract

The motion control of high-precision electromechanitcal systems, such as micropositioners, is challenging in terms of the inherent high nonlinearity, the sensitivity to external interference, and the complexity of accurate identification of the model parameters. To cope with these problems, this work investigates a disturbance observer-based deep reinforcement learning control strategy to realize high robustness and precise tracking performance. Reinforcement learning has shown great potential as optimal control scheme, however, its application in micropositioning systems is still rare. Therefore, embedded with the integral differential compensator (ID), deep deterministic policy gradient (DDPG) is utilized in this work with the ability to not only decrease the state error but also improve the transient response speed. In addition, an adaptive sliding mode disturbance observer (ASMDO) is proposed to further eliminate the collective effect caused by the lumped disturbances. The micropositioner controlled by the proposed algorithm can track the target path precisely with less than 1 $μ$ m error in simulations and actual experiments, which shows the sterling performance and the accuracy improvement of the controller.

Keywords: micropositioners, reinforcement learning, disturbance observer, deep deterministic policy gradient

1. Introduction

Micropositioning technologies based on smart materials in precision industries have gained much attention for numerous potential applications in optical steering, micro-assembly, nano-inscribing, cell manipulation, etc. [1,2,3,4,5,6,7]. One of the greatest challenge in this research field is the uncertainties produced by various factors such as the dynamic model, environmental temperature, sensors performance, and the actuators’ nonlinear characteristics [8,9], which make the control of micropositioning system a demanding problem.

To address the uncertain problem, different kinds of control approach have been developed, such as the PID control method [10], sliding mode control [11,12], and adaptive control [13]. In addition, many researchers have integrated these control strategies to further improve the control performance. Victor et al. have proposed a scalable field-programmable gate array-based motion control system with a parabolic velocity profile [14]. A new seven-segment profile algorithm was developed by Jose et al. to improve the performance of the motion controller [15]. Combined with the backstepping strategy, Fei et al. proposed an adaptive fuzzy sliding mode controller in [16]. Based on the radial basis function neural network (RBFNN) and sliding mode control (SMC), Ruan et al. developed a RBFNN-SMC for nonlinear electromechanical actuator systems [17]. Gharib et al. designed a PID controller with a feedback linearization technique for path tracking control of a micropositioner [18]. Nevertheless, the performance and robustness of such model-based control strategies are still limited by the precision of the dynamics model. On the other hand, a sophisticated system model frequently leads to a complex control strategy. Although many researchers have considered the factors of uncertainties and disturbances, it is still difficult for the system to provide a precise and comprehensive process.

As the rapid development in artificial intelligence in recent years has roundly impacted the traditional control field, learning-based and data-driven approaches, especially reinforcement learning (RL) and neural networks, have become a promising research tropic. Different from traditional control strategies that need to make assumptions based on the dynamics model [19,20], reinforcement learning can directly learn the policy by interacting with the system. Back in 2005, Adda et al. presented a reinforcement learning algorithm for learning control of stochastic micromanipulation systems [21]. Li et al. designed a state–action–reward–state–action (SARSA) method using linear function approximation to generate an optimal path by controlling the choice of the micropositioner [22]; however, the reinforcement learning algorithms such as Q-learning [23] and SARSA [24] utilized in the aforementioned works are unable to deal with complex dynamics problems, especially the continuous state action space problem. With the spectacular improvement enjoyed by deep reinforcement learning (DRL), primarily driven by deep neural networks (DNN) [25], the DRL algorithms, such as the deep Q network (DQN) [26], policy gradient (PG) [27], deterministic policy gradient (DPG) [28], and deep deterministic policy gradient (DDPG) [29] with the ability to approximate the value function, have played an important role in continuous control tasks.

Latifi et al. introduced a model-free neural fitted Q iteration control method for micromanipulation devices; in this work, the DNN is adopted to represent Q-value function [30]. Leinen introduced the concept of experience playback in DQN and the approximate value function of the neural network into the SARSA algorithm for the control of a scanning probe microscope [31]. Both simulation and real experimental results have shown that their proposed RL algorithm based on the neural network could achieve better performance compared to traditional control methods to some extent; however, due to the collective effects of disturbances generated from nonlinear systems and deviations in value functions [29,32,33], the RL control method could induce significant inaccuracies in the tracking control tasks [34]. To improve the anti-disturbance capability and control accuracy, disturbance rejection control [35], time-delay estimation based control [36], disturbance observer-based controllers [37,38] have been proposed successively. To deal with this issue, a deep reinforcement learning controller integrated with an adaptive sliding mode disturbance observer (ASMDO) is developed in this work. Previous research on trajectory tracking control of DRL has shown that apparent state errors have always existed [39,40,41,42]. One of the main reasons is the inaccurate estimation of the action value function in DRL structure. As indicated in [43], even in elementary control tasks, accurate action values cannot be attained from the same action value function; therefore, in this work, the DDPG algorithm is developed with an integral differential compensator (DDPG-ID) added to cope with this situation. In addition, the comparison of the reinforcement learning control method with various common state-of-the-art control methods are listed in Table 1, which shows the pros and cons of these different methods.

Table 1.

Comparison of different control algorithms.

Method	Advantages	Disadvantages
PID control	Simple design structure Easy to implementation	Mainly used in linear systems Requirement of full-state feedback Lack of adaptivity
SMC control	Simple design structure Easy to implementation High robustness	Excessive chattering effect Lack of adaptivity
Adaptive control	Lower initial cost Lower cost of redundancy High reliability and performance	Stability is not treated rigorously High gain observes needed Slow convergence
Backstepping control	Global stability Simple design structure Easy to be integrated	Low anti-interference ability Sensitive to system models Lack of adaptivity
RL control	No need of accurate model Improved control performance High adaptivity	Poor anti-interference ability Easy to generate state error

Open in a new tab

In this study, deep reinforcement learning is leveraged into a novel optimal control scheme for complex systems. An anti-disturbance, stable, and precise control strategy is proposed for the trajectory tracking task of the micropositioner system. The contribution of this works are presented as follows:

(1)
A DDPG-ID algorithm based on deep reinforcement learning is introduced as a basic micropositioner system motion controller, which avoided the limitation of traditional control strategies to the accuracy and comprehensiveness of the dynamic model;
(2)
To eliminate the collective effect caused by the lumped disturbances from the micropositioner system and inaccurate estimation of the value function in deep reinforcement learning, an adaptive sliding mode disturbance observer (ASMDO) is proposed;
(3)
An integral differential compensator is introduced in DDPG-ID to compensate for the feedback state of the system, which improves the accuracy and response time of the controller, and further improves the robustness of the controller subject to external disturbances.

The manuscript is structured as follows. Section 2 presents the system description of the micropositioner. In Section 3, we develop a deep reinforcement learning control method combined with ASMDO and compensator, and parameters of the DNNs are illustrated. Then, simulation parameters and tracking results are given in Section 4. To further evaluate the performance of the proposed control strategy in the micropositioner, tracking experiments are presented in Section 4. Lastly, conclusions are given in Section 5.

2. System Description

The basic structure of micropositioner is shown in Figure 1, which consists of a base, a platform, and a kinematic device. The kinematic device is composed with an armature, an electromagnetic actuator, and a chain mechanism driven by electromagnetic actuator. As shown in Figure 1, there are mutual-perpendicular compliant chains actuated by the electron-magnetic actuator (EMA) in the structure. The movement of the chain mechanism is in accordance with the working air gap y. The EMA generates the magnetic force $T_{m}$ , which can be approximated as:

T_{m} = k {(\frac{I_{c}}{y + p})}^{2}

(1)

where k and p are constant parameters related to the electronmagnetic actuator, $I_{c}$ is the excitation current, and y is the working air gap between the armature and the EMA. Then, the electrical model of the system can be given as:

V_{i} = R I_{c} + \frac{d}{d t} (H I_{c})

(2)

where $V_{i}$ is the input voltage from the EMA, R is the resistance of the coil and H denotes the coil inductance, which can be given as:

H = H_{1} + \frac{p H_{0}}{y + p}

(3)

where $H_{1}$ is the coil inductance while the air gap is infinite, and $H_{0}$ is the incremental inductance when the gap is zero. The motion equation for the micropositioner can be expressed as:

m \frac{d^{2} y}{d t^{2}} = ι (α_{0} - y) - T_{m}

(4)

where $ι$ is the stiffness along the motion direction in the system, and $α_{0}$ is the initial air gap.

The diagrammatic model of EMA actuated micropositioner. (a) The front view of micropositioner. (b) The end view of micropositioner. (c) The vertical view of micropositioner.

According to Equations (1)–(4), they define $x_{1} = y$ , $x_{2} = \dot{y}$ , $x_{3} = I_{c}$ as the state variables and the control input $u = V_{i}$ . Then, the dynamics model of the electromagnetic actuator can be written as:

\{\begin{matrix} {\dot{x}}_{1} = x_{2} \\ {\dot{x}}_{2} = \frac{ι}{m} (α_{0} - x_{1}) - \frac{k}{m} {(\frac{x_{3}}{x_{1} + p})}^{2} \\ {\dot{x}}_{3} = \frac{1}{H} (- R x_{3} + \frac{H_{0} p x_{2} x_{3}}{{(x_{1} + p)}^{2}} + u) \end{matrix}

(5)

Define the variables $z_{1} = x_{1}$ , $z_{2} = x_{2}$ , $z_{3} = \frac{ι}{m} (α_{0} - x_{1}) - \frac{k}{m} {(\frac{x_{3}}{x_{1} + p})}^{2}$ , then we have

\{\begin{matrix} {\dot{z}}_{1} = z_{2} \\ {\dot{z}}_{2} = z_{3} \\ {\dot{z}}_{3} = f (x) + g (x) u \end{matrix}

(6)

where $f (x) = - \frac{ι x_{2}}{m} + \frac{2 k x_{3}^{2}}{m {(x_{1} + p)}^{2}} (\frac{H (x_{1} + p) - p H_{0}}{H {(x_{1} + p)}^{2}} x_{2} + \frac{R}{H})$ , $g (x) = - \frac{2 k x_{3}}{H m {(x_{1} + p)}^{2}}$ , and $z_{1}$ is the system output.

In realistic engineering application, there always exist some uncertainties of the system, then system Equation (6) can be rewritten as:

\{\begin{matrix} {\dot{z}}_{i} = z_{i + 1}, i = 1, 2 \\ {\dot{z}}_{3} = f_{0} (x) + g_{0} (x) u + (Δ f (x) + Δ g (x) u) + d \end{matrix}

(7)

where $f_{0} (x)$ and $g_{0} (x)$ denote the nominal part of the micropositioner system and $Δ f (x)$ , $Δ g (x)$ denote the uncertainties of the modeling system; d denotes the external disturbances. Then, defining $D = (Δ f (x) + Δ g (x) u) + d$ , we have

\{\begin{matrix} {\dot{z}}_{i} = z_{i + 1}, i = 1, 2 \\ {\dot{z}}_{3} = f_{0} (x) + g_{0} (x) u + D \end{matrix}

(8)

where D is the lumped system disturbances. The following assumption is exploited [44]:

Assumption 1.

The lumped interference D is bounded and its upper bound is less than a fixed parameter $β_{1}$ and the derivative of D is unknown but bounded.

Remark 1.

Assumption 1 is reasonable since all micropositioner platforms are accurately designed and parameter identified, and all disturbances are remained in a controllable domain.

3. Design of ASMDO and DDPG-ID Algorithm

In this section, the adaptive sliding mode disturbance observer (ASMDO) is introduced based on the dynamics of the micropositioner. Then, the DDPG-ID control method and pseudocode are given.

3.1. Design of Adaptive Sliding Mode Disturbance Observer

To develop the ASMDO, a virtual dynamic is firstly designed as

\{\begin{matrix} {\dot{η}}_{i} = η_{i + 1}, i = 1, 2 \\ {\dot{η}}_{3} = f (z) + g (z) u + \hat{D} + ρ \end{matrix}

(9)

where $η_{i}, i = 1, 2, 3$ are auxiliary variables, $\hat{D}$ is the estimation of lumped disturbances, $ρ$ denotes the sliding mode term, which is introduced afterwards.

Define a sliding variable $S = σ_{3} + k_{2} σ_{2} + k_{1} σ_{1}$ , where $σ_{i} = x_{i} - η_{i}, i = 1, 2, 3$ , $k_{1}$ and $k_{2}$ are positive design parameters. Then the sliding mode term $ρ$ is designed as

ρ = λ_{1} S + k_{2} σ_{3} + k_{1} σ_{2} + λ_{2} sgn (S)

(10)

where $λ_{1}$ , $λ_{2}$ are positive design parameters with $λ_{2} \geq β_{1}$ .

Choosing an unknown constant $β_{2}$ to present the upper bound of $\dot{D}$ , the ASMDO is proposed as:

\dot{\hat{D}} = k ({\dot{x}}_{3} - f_{0} (z) - g_{0} (z) u - \hat{D}) + ({\hat{β}}_{2} + λ_{3}) sgn (ρ)

(11)

where k and $λ_{3}$ are positive design parameters and ${\hat{β}}_{2}$ is defined as the estimation of $β_{2}$ given by ${\dot{\hat{β}}}_{2} = - δ_{0} {\hat{β}}_{2} + ∥ ρ ∥$ , with $δ_{0}$ is a small positive number.

Then, the output $\hat{D}$ of the ASMDO is used as a compensation of the control input to eliminate the uncertainties generated by the system and external disturbances.

Remark 2.

Choosing $V_{1} = \frac{1}{2} S^{2}$ and $V_{2} = \frac{1}{2} ({\tilde{D}}^{2} + {\tilde{β}}_{2}^{2})$ , where $\tilde{D} = D - \hat{D}$ , ${\tilde{β}}_{2} = β_{2} - {\hat{β}}_{2}$ as two Lyapunov function, derivative $V_{1}$ and $V_{2}$ with respect to time, it is easy to prove that both S and $\tilde{D}$ will exponentially converge to the equilibrium point, so the proof process is not repeated.

3.2. Design of DDPG-ID Algorithm for Micropositioner

The goal of reinforcement learning is to obtain a policy for the agent that could maximizes the cumulative reward through interactions with the environment. The environment is usually formalized as a Makov decision process (MDP) described by a four-tuple $(S, A, P, R)$ , where S, A, P, and R represent the state space of environment, set of actions, state transition probability function, and reward function separately. At each time step t, the agent in current state $s_{t} \in S$ takes action $a_{t} \in A$ from policy $π (a_{t} | s_{t})$ , then the agent acquires a reward $r_{t} \leftarrow R (s_{t}, a_{t})$ and enters the next state $s_{t + 1}$ according to the state transition probability function $P (s_{t + 1} | s_{t}, a_{t})$ . Based on the Markov property, the Bellman equation of action–value function $Q_{π} (s_{t}, a_{t})$ , which is used for calculating the future expected reward, can be given as:

Q_{π} (s_{t}, a_{t}) = E_{π} (r_{t} + γ Q_{π} (s_{t + 1}, a_{t + 1}))

(12)

where $γ \in [0, 1]$ denotes the discount factor.

In trajectory tracking control task of micropositioner, state $s_{t}$ is state array about the air gap y of micropositioner at time t. Action $a_{t}$ is the voltage u applied by the controller to micropositioner. As shown in Figure 2, DDPG is one of actor–critic algorithms, which has an actor and a critic. The actor is responsible for generating actions and interacting with the environment, and the critic evaluates the performance of the actor and guides the action in the next state.

The structure diagram of DDPG-ID algorithm.

The action–value function and policy approximation are parameterized by DNN to solve the continuous states and actions problem in micropositioner with $Q (s_{t}, a_{t}, w^{Q}) ≐ Q_{π} (s_{t}, a_{t})$ , $π_{w^{μ}} (a_{t} | s_{t}) ≐ π (a_{t} | s_{t})$ , where $w^{Q}$ and $w^{μ}$ are the parameters of neural networks in action–value function and policy function. Under the prerequisite of using the neural network approximation representation policy function, the neural network gradient update method is used to seek the optimal policy $π$ .

DDPG-ID uses deterministic policy $π (s_{t}, w^{μ})$ rather than traditional stochastic policy $π_{w^{μ}} (a_{t} | s_{t})$ , where the output of policy is the action $a_{t}$ with highest probability to current state $s_{t}$ , $π (s_{t}, w^{μ}) = a_{t}$ . The policy gradient is given as

\nabla_{w^{μ}} J (π) = E_{s \sim ρ^{π}} [\nabla_{w^{μ}} π (s, w^{μ}) \nabla_{a} Q (s, a, w^{Q})]

(13)

where $J (π) = E_{π} [\sum_{t = 1}^{T} γ^{(t - 1)} r_{t}]$ is the expectation of discount accumulative rewards, T denotes the final time of a whole process, $ρ^{π}$ is the distribution of state following the deterministic policy. Value function $Q (s_{t}, a_{t}, w^{Q})$ is updated by calculating time temporal-difference error (TD-error), which can be defined as

e_{T D} = r_{t} + γ Q (s_{t + 1}, π (s_{t + 1})) - Q (s_{t}, a_{t})

(14)

where $e_{T D}$ is the TD-error, $r_{t} + γ Q (s_{t + 1}, π (s_{t + 1}))$ represents the TD target value. By minimizing the TD-error, the parameters are updated backwards through the neural network gradient.

To avoid the convergence problem of single network caused by correlation between TD target value and current value [45,46], A target Q network $Q_{T}^{'} (s_{t + 1}, a_{t + 1}^{'}, w^{Q^{'}})$ is introduced to calculate network portion of TD target value and an online Q network $Q_{O} (s_{t}, a_{t}, w^{Q})$ is used to calculate current value in critic. Both these two DNN have the same structure. The actor also has an online policy network $π_{O} (s_{t}, w^{μ})$ to generate current action and a target policy network $π_{T} (s_{t}, w^{μ^{'}})$ to provide the target action $a_{t + 1}^{'}$ . $w^{μ^{'}}$ and $w^{Q^{'}}$ separately represent the parameters of target policy and target Q networks.

In order to improve the stability and efficiency during RL training, experience replay technology is utilized in this work, which saves transition experience $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ into the experience replay buffer $Ψ$ at each interaction with the environment for subsequent updates. In each training time t, a minibatch of M transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from the experience replay buffer are extracted to calculate the gradients and update neural networks.

An integral differential compensator is developed in deep reinforcement learning structure to improve the accuracy and responsiveness of tracking tasks in this work, which is shown in Figure 2. The integral portion of the state is utilized to increase the control input continuously, which would eventually reduce tracking error. The differential part is integrated to reduce the system oscillation and accelerates stability. The proposed compensator is designed as follows:

s_{I D}^{t} = y_{e}^{t} + α \sum_{n = 1}^{t} y_{e}^{t} + β (y_{e}^{t} - y_{e}^{t - 1})

(15)

where $s_{I D}^{t}$ represents the compensator error at time t, $y_{e}^{t} = \sqrt{{(y_{d}^{t} - {\hat{y}}^{t})}^{2}}$ , $y_{d}^{t}$ represents the desired trajectory at time t, ${\hat{y}}^{t}$ is the measured air gap at time t and $y_{e}^{t}$ is the error between them. $α$ is the integral gain and $β$ is the differential gain.

Then the state $s_{t}$ at time t can be described as:

s_{t} = {[\begin{matrix} s_{I D}^{t} & {\hat{y}}^{t} & {\dot{\hat{y}}}^{t} & y_{d}^{t} & {\dot{y}}_{d}^{t} \end{matrix}]}^{T}

(16)

where ${\dot{\hat{y}}}^{t}$ and ${\dot{y}}_{d}^{t}$ represent the derivatives of ${\hat{y}}^{t}$ and $y_{d}^{t}$ .

The reward $r_{t}$ function designed is to measure the tracking error:

\begin{matrix} r_{t} = \{\begin{matrix} - 4, y_{e}^{t} > 0.005 \\ + 5, 0.003 < y_{e}^{t} ⩽ 0.005 \\ + 10, 0.001 < y_{e}^{t} ⩽ 0.003 \\ + 18, y_{e}^{t} ⩽ 0.001 \end{matrix} \end{matrix}

(17)

As shown in Figure 3, the adaptive sliding mode disturbance observer (ASMDO) is embedded into the DDPG-ID between the actor and micropositioner system environment. Action $a_{t}$ with the environment is expressed as

a_{t} = π_{O} (s_{t}, w^{μ}) + {\hat{D}}_{t} + N_{t}

(18)

where $w^{μ}$ is the parameters of online policy network $π_{O}$ , ${\hat{D}}_{t}$ is the estimation of the micropositioner system at time t, and $N_{t}$ is Gaussian noise for action exploration.

3.2.1. Critic Update

After selecting M transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ samples from experience replay buffer $Ψ$ , the Q value is calculated. The online Q network is responsible for calculating the current Q value, which is as follows:

Q_{O} (s_{j}, a_{j}, w^{Q}) = w^{Q} ϕ (s_{j}, a_{j})

(19)

where $ϕ (s_{j}, a_{j})$ represents the input of online Q network, which is an eigenvector consisting of state $s_{j}$ and action $a_{j}$ .

The target Q network $Q_{T}^{'}$ is defined as:

Q_{T}^{'} (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}), w^{Q^{'}}) = w^{Q^{'}} ϕ (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}))

(20)

where $ϕ (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}))$ is the input of the target Q network, which is a eigenvector consisting state $s_{j + 1}$ and target policy network output $π_{T} (s_{j + 1}, w^{μ^{'}})$ .

For target policy network $π_{T}$ , the equation is:

π_{T} (s_{j + 1}, w^{μ^{'}}) = w^{μ^{'}} s_{j + 1}

(21)

Then, we rewrite the target Q value $Q_{T}$ as:

Q_{T} = r_{j} + γ Q_{T}^{'} (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}), w^{Q^{'}})

(22)

where $r_{j}$ is the reward from the selected samples.

Since M transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ are sampled from experience buffer $Ψ$ , the loss function of the update critic is shown in Equation (23).

L (w^{Q}) = \frac{1}{M} \sum_{j = 1}^{M} {(Q_{T} - Q_{O} (s_{j}, a_{j}, w^{Q}))}^{2}

(23)

where $L (w^{Q})$ is the loss value of critic.

In order to smooth the target network update process, the soft update is applied without copying parameters periodically as:

w^{Q^{'}} \leftarrow τ w^{Q} + (1 - τ) w^{Q^{'}}

(24)

where $τ$ is the update factor, usually a small constant.

The diagram of Q network is shown in Figure 4, which is a parallel neural network. The Q network includes both state and action portions, and the output value of Q network is based on state and action. The state portion of the neural network consists of a state input layer, three full connection layers, and two ReLU layers clamped between the three full connection layers. The neural network of the action portion contains an action input layer and a full connection layer. The output layers of the above two portions are combined entering the neural network of the common part, which contains a ReLU layer and one output layer.

The parameters of each layer in the Q network are shown in Table 2.

Table 2.

Q network parameters.

Network Layer Name	Number of Nodes
StateLayer	5
CriticStateFC1	120
CriticStateFC2	60
CriticStateFC3	60
ActionInput	1
CriticActionFC1	60
addLayer	2
CriticOutput	1

Open in a new tab

3.2.2. Actor Update

The output of online policy network is

π_{O} = w^{μ} s_{j}

(25)

On account of using deterministic policy, the calculation of the policy gradient has no integrals of action a, but instead has the derivatives of the value function $Q_{O}$ with respect to action a in comparison with stochastic policy. The gradient formula can be rewritten as follows:

\nabla_{w^{μ}} J \approx \frac{1}{M} \sum_{j}^{M} (\nabla_{a_{j}} Q_{O} (s_{j}, a j, w^{Q}) \nabla_{w^{μ}} π_{O} (s_{j}, w^{μ}))

(26)

where the weights $w^{μ}$ are updated with the gradient back-propagation method. The target policy network is also updated with soft update pattern as follows:

w^{μ^{'}} \leftarrow τ w^{μ} + (1 - τ) w^{μ^{'}}

(27)

where $τ$ is the update factor, usually a small constant.

Figure 5 shows the diagram of the policy network in this paper, which contains a state input layer, a full connection layer, a tanh layer, and an output layer. The parameters of each layer in the policy network are shown in Table 3.

Table 3.

Policy network parameters.

Network Layer Name	Number of Nodes
StateLayer	5
ActorFC1	30
ActorOutput	1

Open in a new tab

The Algorithm 1 pseudocode can be shown as:

Algorithm 1 DDPG-ID Algorithm.

1:
Randomly initialize online Q network with weights $w^{Q}$
2:
Randomly initialize online policy network with weights $w^{μ}$
3:
Initialize the target Q network by $w^{Q^{'}} \leftarrow w^{Q}$
4:
Initialize the target policy network by $w^{μ^{'}} \leftarrow w^{μ}$
5:
Initialize the experience replay buffer $Ψ$
6:
Load the simplified micropositioner dynamic model
7:
for episode = 1, MaxEpisode do
8:
Initialize a noise process $N$ for exploration
9:
Initialize ASMDO and ID compensator
10:
Randomly initialize micropositioner states
11:
Receive initial observation state $s_{1}$
12:
for step = 1, T do
13:
Select action $a_{t} = π_{O} (s_{t}) + {\hat{D}}_{t} + N_{t}$
14:
Use $a_{t}$ to run micropositioner system model
15:
Process errors with integral differential compensator
16:
Receive reward $r_{t}$ and new state $s_{t + 1}$
17:
Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer $Ψ$
18:
Randomly sample a minibatch of M transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from $Ψ$
19:
Set $Q_{T} = r_{j} + γ Q_{T}^{'} (s_{j + 1}, π_{T} (s_{j + 1}, w^{μ^{'}}), w^{Q^{'}})$
20:
Minimize loss: $L (w^{Q}) = \frac{1}{M} \sum_{j = 1}^{M} {(Q_{T} - Q_{O} (s_{j}, a_{j}, w^{Q}))}^{2}$ to update online Q network
21:
Use the sampled policy gradient to update online policy network:

$\nabla_{w^{μ}} J = \frac{1}{M} \sum_{j}^{M} (\nabla_{a_{j}} Q_{O} (s_{j}, a_{j}, w^{Q}) \nabla_{w^{μ}} π_{O} (s_{j}, w^{μ}))$
22:
Update the target networks: $w^{Q^{'}} \leftarrow τ w^{Q} + (1 - τ) w^{Q^{'}}, w^{μ^{'}} \leftarrow τ w^{μ} + (1 - τ) w^{μ^{'}}$
23:
end for
24:
end for

Open in a new tab

4. Simulation and Experimental Results

In this section, two kinds of periodic external disturbances were added to verify the practicability of the proposed ASMDO and three distinct desired trajectories were utilized to evaluate the performance of proposed deep reinforcement learning control strategy. An traditional DDPG algorithm and a well-tuned PID strategy were adopted for comparison. To further verify the spatial performances of the proposed algorithm, two kinds of different trajectories were introduced in the experiments.

4.1. Simulation Results

The parametric equations of two kinds of periodic external disturbances are defined as $d_{1} = 0.1 sin (2 π t) + 0.1 sin (0.5 π t + \frac{π}{3})$ , and $d_{2} = 0.1 + 0.1 sin (0.5 π t + \frac{π}{3})$ . Based on the micropositoner model proposed in [44], the effectiveness of the observer is presented in Figure 6 and Figure 7.

Observation result of ASMDO with $d_{2}$ . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.

Observation result of ASMDO with $d_{1}$ . (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.

The disturbance estimation results from the proposed ASMDO are presented in Figure 6a and Figure 7a, it is can be seen that the observer could track the given disturbance rapidly. The estimation errors are less than 0.01 mm in Figure 6b and Figure 7b, which shows the effectiveness of the ASMDO as interference compensation.

The dynamics model of micropositioner is given in Section 2, and its basic system model parameters are from our previous research [44,47], which is shown in Table 4. The DDPG algorithm is defined in same neural network structure and training parameters as DDPG-ID in this paper. The training parameters of the DDPG-ID and DDPG are shown in Table 5.

Table 4.

Parameters of the micropositioner model.

Notation	Value	Unit
$L_{1}$	13.21	$H$
$L_{0}$	0.67	$H$
a	$1.11 \times 10^{- 5}$	$m$
R	43.66	$Ω$
c	$8.83 \times 10^{- 5}$	${Nm}^{2} A^{- 2}$
k	$1.803 \times 10^{N 5}$	${Nm}^{- 1}$
m	0.0272	$Kg$

Open in a new tab

Table 5.

Training parameters of DDPG-ID and DDPG.

Hyperparameters	Value
Learning rate for actor $φ_{1}$	0.001
Learning rate for critic $φ_{2}$	0.001
Discount factor $γ$	0.99
Initial exploration $ε$	1
Experience replay buffer size $ψ$	100,000
Minibatch size M	64
Max episode $ϖ$	1500
Soft update factor $τ$	0.05
Max exploration steps T	250 (25 s)
Time step $T_{s}$	0.01 s
Intergal gain $α$	0.01
Differential gain $β$	0.001

Open in a new tab

The first desired trajectory designed for tracking control simulation is a waved signal. According to the initial conditions, the parametric equation of the waved trajectory is defined as:

y_{d} (t) = 0.985 - 0.015 sin (\frac{π t}{4} - \frac{π}{2})

(28)

The training process of both DDPG-ID and DDPG are run on the same model with stochastic initialized micropositioner states. During the training evaluation, a larger episode reward indicates a more accurate and lower error control policy. It is shown in Figure 8 that DDPG-ID reaches the maximum reward score with fewer episodes compared to DDPG, which reveals that DDPG-ID algorithm converge faster than DDPG algorithm. Comparing Figure 8a with Figure 8b, the average reward of DDPG-ID training process is larger than DDPG’s average reward in stable state, which further indicates that policy learned by DDPG-ID algorithm has better performance. The trained algorithms are employed for tracking control of micropositioner system simulation experiments.

The training rewards of two RL schemes. (a) The training rewards generated by DDPG-ID. (b) The training rewards generated by DDPG.

The tracking results of the waved trajectory is shown in Figure 9. The RMSE value, MAX value, and mean value of the tracking errors for these three control methods are provided in Table 6. In terms of tracking accuracy, the trained DDPG-ID controller has a better performance compared to DDPG and PID, which has smaller state error and smoother tracking trajectory. The tracking error of the DDPG-ID algorithm ranges from $- 8 \times 10^{- 4}$ to $9 \times 10^{- 4}$ mm, which is almost about a half of the DDPG policy. In the interim, the DDPG controller has a lesser tracking error than PID. A huge oscillation has been induced by the PID controller, which will affect the hardware to a certain extent in the actual operation process. This huge oscillation input signal is much larger than a normal control input signal, which typically ranges from 0 to 11 V. Based on the characteristics of reinforcement learning, it is hard for a well-trained policy to generate such a shock signal.

Tracking results comparison of the waved trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.

Table 6.

Tracking errors comparison of different controllers in the waved trajectory.

	RMSE	MAX	MEAN
DDPG-ID	$3.658 \times 10^{- 4}$	$4.758 \times 10^{- 4}$	$1.003 \times 10^{- 4}$
DDPG	$1.093 \times 10^{- 3}$	$2.615 \times 10^{- 3}$	$4.414 \times 10^{- 4}$
PID	$1.654 \times 10^{- 3}$	$3.144 \times 10^{- 4}$	$3.104 \times 10^{- 4}$

Open in a new tab

As can be seen in these figures, the tracking error of DDPG-ID in periodic trajectory is still less than the others, which ranges from $- 1.6 \times 10^{- 4}$ to $9 \times 10^{- 4}$ mm. Similar to the previous waved trajectory, the control input based on DDPG has shown better performance in terms of oscillations.

Another tracking results of a periodic trajectory is illustrated in Figure 10, and the tracking errors comparison of these three control methods are given in Table 7. The parametric equation of the periodic trajectory is defined as

y_{d} (t) = 0.981 - 0.015 sin (\frac{π t}{4} - \frac{π}{2}) + 0.008 sin (\frac{π t}{2} - \frac{π}{16}) .

(29)

Tracking results comparison of the periodic trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.

Table 7.

Tracking errors comparison of different controllers in the periodic trajectory.

	RMSE	MAX	MEAN
DDPG-ID	$4.272 \times 10^{- 4}$	$8.471 \times 10^{- 4}$	$5.404 \times 10^{- 5}$
DDPG	$1.545 \times 10^{- 3}$	$3.102 \times 10^{- 3}$	$1.610 \times 10^{- 4}$
PID	$1.923 \times 10^{- 3}$	$3.376 \times 10^{- 3}$	$3.311 \times 10^{- 4}$

Open in a new tab

To further demonstrate the universality of the DDPG-ID policy, a periodic step trajectory is also utilized for comparison. The step signal with a period of 8 s is designed as the desired trajectory, which is shown in Figure 11a. The well-tuned PID controller is also tested in this step trajectory simulation. Since intense oscillations emerge, the results of PID show extremely worse performance are not shown in this paper.

Tracking results comparison of the step trajectory. (a) Tracking results comparison based on two control schemes. (b) Tracking error comparison based on two control schemes. (c) Control input comparison based on two control schemes.

According to Figure 11, the tracking result of DDPG-ID algorithm remains stable with the tracking error bounded in $- 2 \times 10^{- 4}$ to $9 \times 10^{- 4}$ mm, which is still as a half of DDPG’s performance. Due to the characteristic of the step signal, the state error will become tremendous during the step transition. Errors of DDPG-ID and DDPG are observed dropping quickly after step transition. It can be seen from Table 8 that the errors of DDPG-ID algorithm are substantially less than that of DDPG algorithm. As to the control inputs, the value of DDPG still fluctuates considerably when the state converges stable.

Table 8.

Tracking errors comparison of different controllers in the step trajectory.

	RMSE	MAX	MEAN
DDPG-ID	$4.612 \times 10^{- 3}$	$0.02953$	$6.938 \times 10^{- 4}$
DDPG	$5.279 \times 10^{- 3}$	$0.02986$	$1.437 \times 10^{- 3}$

Open in a new tab

According to above simulation results, it can be concluded that the control policy of DDPG-ID has triumphantly dealt with collective effect caused by disturbance and inaccurate estimation of deep reinforcement learning comparing to DDPG. The comparison results also have demonstrated the excellent control performance of the policy learned by DDPG-ID algorithm.

4.2. Experimental Results

The speed, acceleration, and direction of these designed trajectories vary with time, which makes the experiments results more trustworthy. In each test, the EMA in micropositioner is regulated for tracking the desired path of working air gap.

As shown in Figure 12, a laser displacement sensor is utilized to detect the motion states. Then DDPG-ID algorithm was administered through a SimLab board transplanted with Matlab-Simulink. The EMA controls the movement of the chain mechanism by executing the control signal, which is from the analog output port of SimLab board. The analog input port of SIMLAB board is connected with the signal output from the laser displacement sensor.

The schematic diagram of experiment system.

Figure 13 shows the tracking experiment results of the waved trajectory. It reaches the starting point on a straight track with a speed of 5.6 $μ$ m/s. At time 5 s, it begins to track the desired waved trajectory in three periods, and the waved trajectory can be described as $y_{d} (t) = 28 + 25 sin (\frac{π t}{10} + \frac{π}{2})$ . The tracking error fluctuates within $\pm 1.5 μ$ m, which is demonstrated in Figure 13b. Except for several particular points of time, the tracking errors could range from ±1 $μ$ m.

Tracking results of the waved trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.

Another periodic trajectory tracking experiment was also executed. As shown in Figure 14, the desired periodic trajectory starts at time 5 s, and it is defined as $y_{d} (t) = 35 - 25 sin (\frac{π t}{7.5} - \frac{2 π}{3}) - 5 sin (\frac{π t}{15} + \frac{π}{6})$ . The tracking error of the periodic trajectory still range from $\pm 1.5$ $μ$ m.

Tracking results comparison of the step trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.

The experimental results show that the proposed DDPG-ID algorithm is able to closely track above two trajectories. Compared with the simulation results, the tracking error does not increase significantly, and it can be maintained between −1 $μ$ m and +1 $μ$ m.

5. Conclusions and Future Works

In this paper, a composite controller is developed based on an adaptive sliding mode disturbance observer and a deep reinforcement learning control scheme. A deep deterministic policy gradient is utilized to obtain the optimal control performance. To improve the tracking accuracy and transient response time, an integral differential compensator is applied during the learning process in the actor–critic framework. An adaptive sliding mode disturbance observer is developed to further retrench the influence of modeling uncertainty, external disturbances, and the effect of inaccurate value function. In comparison with the existing DDPG and the most commonly used PID controller, the trajectory tracking results has successfully indicated the satisfactory performances and the precision of the control policy based on the DDPG-ID algorithm in the simulation. The tracking errors are less than 1 $μ$ m, which shows the significant tracking efficiency of the proposed methods. The experimental results also indicate the high accuracy and strong anti-interference capability of the proposed deep reinforcement learning control scheme. To further improve the tracking effect and realize micro-manipulation tasks in the future work, specific operation experiments will be performed such as cell manipulation, micro-assembly, etc.

Abbreviations

$P I D$	Proportional–integral–derivative control
$R B F N N$	Radial basis neural network
$R L$	Reinforcement learning
$S A R S A$	State-Action-Reward-State-Action
Q	The Value of Action in reinforcement learning
$D R L$	Deep reinforcement learning
$D N N$	Deep neural networks
$D Q N$	Deep Q network
$P G$	Policy gradient
$D D P G$	Deep deterministic policy gradient
$I D$	Integral differential compensator
$T_{m}$	The magnetic force
y	The working air gap in micropositioner
$I_{c}$	The excitation current in micropositioner
$E M A$	The electron-magnetic actuator
$V_{i}$	The input voltage from the electron-magnetic actuator
R	The resistance of the coil in micropositioner
H	The coil inductance in micropositioner
u	The control input
D	The lumped system disturbance
$A S M D O$	Adaptive Sliding Mode Disturbance Observer
$s_{t}$	The state at time t in reinforcement learning
$a_{t}$	The action at time t in reinforcement learning
$r_{t}$	The reward at time t in reinforcement learning
$R e L U$	Rectified linear unit activation function
$t a n h$	Hyperbolic tangent activation function

Open in a new tab

Author Contributions

Writing—original draft preparation, S.L., R.X., X.X. and Z.Y.; writing—review and editing, S.L. and R.X.; data collection, S.L. and R.X.; visualization, S.L., R.X., X.X. and Z.Y.; supervision, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the Science and Technology Development Fund, Macau SAR (Grant No. 0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Ministry of Science and Technology of China (Grant No. 2019YFB1600700), in part by the Guangdong Science and Technology Department (Grant No. 2018B030324002 and 2020B1515130001), in part by the Zhuhai Science and Technology Innovation Bureau (Grant no. ZH22017002200001PWC), Jiangsu Science and Technology Department (Grant No. BZ2021061), and in part by the University of Macau (Grant No. MYRG2020-00253-FST).

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Català-Castro F., Martín-Badosa E. Positioning Accuracy in Holographic Optical Traps. Micromachines. 2021;12:559. doi: 10.3390/mi12050559. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bettahar H., Clévy C., Courjal N., Lutz P. Force-Position Photo-Robotic Approach for the High-Accurate Micro-Assembly of Photonic Devices. IEEE Robot. Autom. Lett. 2020;5:6396–6402. doi: 10.1109/LRA.2020.3014634. [DOI] [Google Scholar]
3.Cox L.M., Martinez A.M., Blevins A.K., Sowan N., Ding Y., Bowman C.N. Nanoimprint lithography: Emergent materials and methods of actuation. Nano Today. 2020;31:100838. doi: 10.1016/j.nantod.2019.100838. [DOI] [Google Scholar]
4.Dai C., Zhang Z., Lu Y., Shan G., Wang X., Zhao Q., Ru C., Sun Y. Robotic manipulation of deformable cells for orientation control. IEEE Trans. Robot. 2019;36:271–283. doi: 10.1109/TRO.2019.2946746. [DOI] [Google Scholar]
5.Zhang P., Yang Z. A robust adaboost. rt based ensemble extreme learning machine. Math. Probl. Eng. 2015;2015:260970. doi: 10.1155/2015/260970. [DOI] [Google Scholar]
6.Yang Z., Wong P., Vong C., Zong J., Liang J. Simultaneous-fault diagnosis of gas turbine generator systems using a pairwise-coupled probabilistic classifier. Math. Probl. Eng. 2013;2013:827128. doi: 10.1155/2013/827128. [DOI] [Google Scholar]
7.Wang D., Zhou L., Yang Z., Cui Y., Wang L., Jiang J., Guo L. A new testing method for the dielectric response of oil-immersed transformer. IEEE Trans. Ind. Electron. 2019;67:10833–10843. doi: 10.1109/tie.2019.2959500. [DOI] [Google Scholar]
8.Roshandel N., Soleymanzadeh D., Ghafarirad H., Koupaei A.S. A modified sensorless position estimation approach for piezoelectric bending actuators. Mech. Syst. Signal Process. 2021;149:107231. doi: 10.1016/j.ymssp.2020.107231. [DOI] [Google Scholar]
9.Ding B., Yang Z.X., Xiao X., Zhang G. Design of reconfigurable planar micro-positioning stages based on function modules. IEEE Access. 2019;7:15102–15112. doi: 10.1109/ACCESS.2019.2894619. [DOI] [Google Scholar]
10.García-Martínez J.R., Cruz-Miguel E.E., Carrillo-Serrano R.V., Mendoza-Mondragón F., Toledano-Ayala M., Rodríguez-Reséndiz J. A PID-type fuzzy logic controller-based approach for motion control applications. Sensors. 2020;20:5323. doi: 10.3390/s20185323. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Salehi Kolahi M.R., Gharib M.R., Heydari A. Design of a non-singular fast terminal sliding mode control for second-order nonlinear systems with compound disturbance. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2021;235:7343–7352. doi: 10.1177/09544062211032990. [DOI] [Google Scholar]
12.Nguyen M.H., Dao H.V., Ahn K.K. Adaptive Robust Position Control of Electro-Hydraulic Servo Systems with Large Uncertainties and Disturbances. Appl. Sci. 2022;12:794. doi: 10.3390/app12020794. [DOI] [Google Scholar]
13.Cruz-Miguel E.E., García-Martínez J.R., Rodríguez-Reséndiz J., Carrillo-Serrano R.V. A new methodology for a retrofitted self-tuned controller with open-source fpga. Sensors. 2020;20:6155. doi: 10.3390/s20216155. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Montalvo V., Estévez-Bén A.A., Rodríguez-Reséndiz J., Macias-Bobadilla G., Mendiola-Santíbañez J.D., Camarillo-Gómez K.A. FPGA-Based Architecture for Sensing Power Consumption on Parabolic and Trapezoidal Motion Profiles. Electronics. 2020;9:1301. doi: 10.3390/electronics9081301. [DOI] [Google Scholar]
15.García-Martínez J.R., Rodríguez-Reséndiz J., Cruz-Miguel E.E. A new seven-segment profile algorithm for an open source architecture in a hybrid electronic platform. Electronics. 2019;8:652. doi: 10.3390/electronics8060652. [DOI] [Google Scholar]
16.Fei J., Fang Y., Yuan Z. Adaptive Fuzzy Sliding Mode Control for a Micro Gyroscope with Backstepping Controller. Micromachines. 2020;11:968. doi: 10.3390/mi11110968. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Ruan W., Dong Q., Zhang X., Li Z. Friction Compensation Control of Electromechanical Actuator Based on Neural Network Adaptive Sliding Mode. Sensors. 2021;21:1508. doi: 10.3390/s21041508. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Gharib M.R., Koochi A., Ghorbani M. Path tracking control of electromechanical micro-positioner by considering control effort of the system. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2021;235:984–991. doi: 10.1177/0959651820953275. [DOI] [Google Scholar]
19.Han M., Tian Y., Zhang L., Wang J., Pan W. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee. Automatica. 2021;129:109689. doi: 10.1016/j.automatica.2021.109689. [DOI] [Google Scholar]
20.de Orio R.L., Ender J., Fiorentini S., Goes W., Selberherr S., Sverdlov V. Optimization of a spin-orbit torque switching scheme based on micromagnetic simulations and reinforcement learning. Micromachines. 2021;12:443. doi: 10.3390/mi12040443. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Adda C., Laurent G.J., Le Fort-Piat N. Learning to control a real micropositioning system in the STM-Q framework; Proceedings of the 2005 IEEE International Conference on Robotics and Automation; Barcelona, Spain. 18–22 April 2005; pp. 4569–4574. [Google Scholar]
22.Li J., Li Z., Chen J. Reinforcement learning based precise positioning method for a millimeters-sized omnidirectional mobile microrobot; Proceedings of the International Conference on Intelligent Robotics and Applications; Wuhan, China. 15–17 October 2008; pp. 943–952. [Google Scholar]
23.Shi H., Shi L., Sun G., Hwang K.S. Adaptive Image-Based Visual Servoing for Hovering Control of Quad-Rotor. IEEE Trans. Cogn. Dev. Syst. 2019;12:417–426. doi: 10.1109/TCDS.2019.2908923. [DOI] [Google Scholar]
24.Zheng N., Ma Q., Jin M., Zhang S., Guan N., Yang Q., Dai J. Abdominal-waving control of tethered bumblebees based on sarsa with transformed reward. IEEE Trans. Cybern. 2018;49:3064–3073. doi: 10.1109/TCYB.2018.2838595. [DOI] [PubMed] [Google Scholar]
25.Tang L., Yang Z.X., Jia K. Canonical correlation analysis regularization: An effective deep multiview learning baseline for RGB-D object recognition. IEEE Trans. Cogn. Dev. Syst. 2018;11:107–118. doi: 10.1109/TCDS.2018.2866587. [DOI] [Google Scholar]
26.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]
27.Sutton R.S., McAllester D.A., Singh S.P., Mansour Y. Policy gradient methods for reinforcement learning with function approximation; Proceedings of the Advances in Neural Information Processing Systems; Denver, CO, USA. 27–30 November 2000; pp. 1057–1063. [Google Scholar]
28.Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M. Deterministic policy gradient algorithms; Proceedings of the International Conference on Machine Learning (PMLR); Bejing, China. 22–24 June 2014; pp. 387–395. [Google Scholar]
29.Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous control with deep reinforcement learning. arXiv. 20151509.02971 [Google Scholar]
30.Latifi K., Kopitca A., Zhou Q. Model-free control for dynamic-field acoustic manipulation using reinforcement learning. IEEE Access. 2020;8:20597–20606. doi: 10.1109/ACCESS.2020.2969277. [DOI] [Google Scholar]
31.Leinen P., Esders M., Schütt K.T., Wagner C., Müller K.R., Tautz F.S. Autonomous robotic nanofabrication with reinforcement learning. Sci. Adv. 2020;6:eabb6987. doi: 10.1126/sciadv.abb6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
33.Zeng Y., Wang G., Xu B. A basal ganglia network centric reinforcement learning model and its application in unmanned aerial vehicle. IEEE Trans. Cogn. Dev. Syst. 2017;10:290–303. doi: 10.1109/TCDS.2017.2649564. [DOI] [Google Scholar]
34.Guo X., Yan W., Cui R. Event-triggered reinforcement learning-based adaptive tracking control for completely unknown continuous-time nonlinear systems. IEEE Trans. Cybern. 2019;50:3231–3242. doi: 10.1109/TCYB.2019.2903108. [DOI] [PubMed] [Google Scholar]
35.Zhang J., Shi P., Xia Y., Yang H., Wang S. Composite disturbance rejection control for Markovian Jump systems with external disturbances. Automatica. 2020;118:109019. doi: 10.1016/j.automatica.2020.109019. [DOI] [Google Scholar]
36.Ahmed S., Wang H., Tian Y. Adaptive high-order terminal sliding mode control based on time delay estimation for the robotic manipulators with backlash hysteresis. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1128–1137. doi: 10.1109/TSMC.2019.2895588. [DOI] [Google Scholar]
37.Chen M., Xiong S., Wu Q. Tracking flight control of quadrotor based on disturbance observer. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1414–1423. doi: 10.1109/TSMC.2019.2896891. [DOI] [Google Scholar]
38.Zhao Z., He X., Ahn C.K. Boundary disturbance observer-based control of a vibrating single-link flexible manipulator. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:2382–2390. doi: 10.1109/TSMC.2019.2912900. [DOI] [Google Scholar]
39.Alibekov E., Kubalík J., Babuška R. Policy derivation methods for critic-only reinforcement learning in continuous spaces. Eng. Appl. Artif. Intell. 2018;69:178–187. doi: 10.1016/j.engappai.2017.12.004. [DOI] [Google Scholar]
40.Hasselt H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010;23:2613–2621. [Google Scholar]
41.Zhang S., Sun C., Feng Z., Hu G. Trajectory-Tracking Control of Robotic Systems via Deep Reinforcement Learning; Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM); Bangkok, Thailand. 18–20 November 2019; pp. 386–391. [Google Scholar]
42.Kiumarsi B., Vamvoudakis K.G., Modares H., Lewis F.L. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2017;29:2042–2062. doi: 10.1109/TNNLS.2017.2773458. [DOI] [PubMed] [Google Scholar]
43.Yang X., Zhang H., Wang Z. Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control; Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC); Kunming, China. 22–24 May 2021; pp. 59–64. [Google Scholar]
44.Xiao X., Xi R., Li Y., Tang Y., Ding B., Ren H., Meng M.Q.H. Design and control of a novel electromagnetic actuated 3-DoFs micropositioner. Microsyst. Technol. 2021;27:1–10. doi: 10.1007/s00542-020-05163-3. [DOI] [Google Scholar]
45.Tommasino P., Caligiore D., Mirolli M., Baldassarre G. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Trans. Cogn. Dev. Syst. 2016;11:292–317. [Google Scholar]
46.Srikant R., Ying L. Finite-time error bounds for linear stochastic approximation andtd learning; Proceedings of the Conference on Learning Theory (PMLR); Phoenix, AZ, USA. 25–28 June 2019; pp. 2803–2830. [Google Scholar]
47.Feng Z., Ming M., Ling J., Xiao X., Yang Z.X., Wan F. Fractional delay filter based repetitive control for precision tracking: Design and application to a piezoelectric nanopositioning stage. Mech. Syst. Signal Process. 2022;164:108249. doi: 10.1016/j.ymssp.2021.108249. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[B1-micromachines-13-00458] 1.Català-Castro F., Martín-Badosa E. Positioning Accuracy in Holographic Optical Traps. Micromachines. 2021;12:559. doi: 10.3390/mi12050559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2-micromachines-13-00458] 2.Bettahar H., Clévy C., Courjal N., Lutz P. Force-Position Photo-Robotic Approach for the High-Accurate Micro-Assembly of Photonic Devices. IEEE Robot. Autom. Lett. 2020;5:6396–6402. doi: 10.1109/LRA.2020.3014634. [DOI] [Google Scholar]

[B3-micromachines-13-00458] 3.Cox L.M., Martinez A.M., Blevins A.K., Sowan N., Ding Y., Bowman C.N. Nanoimprint lithography: Emergent materials and methods of actuation. Nano Today. 2020;31:100838. doi: 10.1016/j.nantod.2019.100838. [DOI] [Google Scholar]

[B4-micromachines-13-00458] 4.Dai C., Zhang Z., Lu Y., Shan G., Wang X., Zhao Q., Ru C., Sun Y. Robotic manipulation of deformable cells for orientation control. IEEE Trans. Robot. 2019;36:271–283. doi: 10.1109/TRO.2019.2946746. [DOI] [Google Scholar]

[B5-micromachines-13-00458] 5.Zhang P., Yang Z. A robust adaboost. rt based ensemble extreme learning machine. Math. Probl. Eng. 2015;2015:260970. doi: 10.1155/2015/260970. [DOI] [Google Scholar]

[B6-micromachines-13-00458] 6.Yang Z., Wong P., Vong C., Zong J., Liang J. Simultaneous-fault diagnosis of gas turbine generator systems using a pairwise-coupled probabilistic classifier. Math. Probl. Eng. 2013;2013:827128. doi: 10.1155/2013/827128. [DOI] [Google Scholar]

[B7-micromachines-13-00458] 7.Wang D., Zhou L., Yang Z., Cui Y., Wang L., Jiang J., Guo L. A new testing method for the dielectric response of oil-immersed transformer. IEEE Trans. Ind. Electron. 2019;67:10833–10843. doi: 10.1109/tie.2019.2959500. [DOI] [Google Scholar]

[B8-micromachines-13-00458] 8.Roshandel N., Soleymanzadeh D., Ghafarirad H., Koupaei A.S. A modified sensorless position estimation approach for piezoelectric bending actuators. Mech. Syst. Signal Process. 2021;149:107231. doi: 10.1016/j.ymssp.2020.107231. [DOI] [Google Scholar]

[B9-micromachines-13-00458] 9.Ding B., Yang Z.X., Xiao X., Zhang G. Design of reconfigurable planar micro-positioning stages based on function modules. IEEE Access. 2019;7:15102–15112. doi: 10.1109/ACCESS.2019.2894619. [DOI] [Google Scholar]

[B10-micromachines-13-00458] 10.García-Martínez J.R., Cruz-Miguel E.E., Carrillo-Serrano R.V., Mendoza-Mondragón F., Toledano-Ayala M., Rodríguez-Reséndiz J. A PID-type fuzzy logic controller-based approach for motion control applications. Sensors. 2020;20:5323. doi: 10.3390/s20185323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11-micromachines-13-00458] 11.Salehi Kolahi M.R., Gharib M.R., Heydari A. Design of a non-singular fast terminal sliding mode control for second-order nonlinear systems with compound disturbance. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2021;235:7343–7352. doi: 10.1177/09544062211032990. [DOI] [Google Scholar]

[B12-micromachines-13-00458] 12.Nguyen M.H., Dao H.V., Ahn K.K. Adaptive Robust Position Control of Electro-Hydraulic Servo Systems with Large Uncertainties and Disturbances. Appl. Sci. 2022;12:794. doi: 10.3390/app12020794. [DOI] [Google Scholar]

[B13-micromachines-13-00458] 13.Cruz-Miguel E.E., García-Martínez J.R., Rodríguez-Reséndiz J., Carrillo-Serrano R.V. A new methodology for a retrofitted self-tuned controller with open-source fpga. Sensors. 2020;20:6155. doi: 10.3390/s20216155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14-micromachines-13-00458] 14.Montalvo V., Estévez-Bén A.A., Rodríguez-Reséndiz J., Macias-Bobadilla G., Mendiola-Santíbañez J.D., Camarillo-Gómez K.A. FPGA-Based Architecture for Sensing Power Consumption on Parabolic and Trapezoidal Motion Profiles. Electronics. 2020;9:1301. doi: 10.3390/electronics9081301. [DOI] [Google Scholar]

[B15-micromachines-13-00458] 15.García-Martínez J.R., Rodríguez-Reséndiz J., Cruz-Miguel E.E. A new seven-segment profile algorithm for an open source architecture in a hybrid electronic platform. Electronics. 2019;8:652. doi: 10.3390/electronics8060652. [DOI] [Google Scholar]

[B16-micromachines-13-00458] 16.Fei J., Fang Y., Yuan Z. Adaptive Fuzzy Sliding Mode Control for a Micro Gyroscope with Backstepping Controller. Micromachines. 2020;11:968. doi: 10.3390/mi11110968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17-micromachines-13-00458] 17.Ruan W., Dong Q., Zhang X., Li Z. Friction Compensation Control of Electromechanical Actuator Based on Neural Network Adaptive Sliding Mode. Sensors. 2021;21:1508. doi: 10.3390/s21041508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18-micromachines-13-00458] 18.Gharib M.R., Koochi A., Ghorbani M. Path tracking control of electromechanical micro-positioner by considering control effort of the system. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2021;235:984–991. doi: 10.1177/0959651820953275. [DOI] [Google Scholar]

[B19-micromachines-13-00458] 19.Han M., Tian Y., Zhang L., Wang J., Pan W. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee. Automatica. 2021;129:109689. doi: 10.1016/j.automatica.2021.109689. [DOI] [Google Scholar]

[B20-micromachines-13-00458] 20.de Orio R.L., Ender J., Fiorentini S., Goes W., Selberherr S., Sverdlov V. Optimization of a spin-orbit torque switching scheme based on micromagnetic simulations and reinforcement learning. Micromachines. 2021;12:443. doi: 10.3390/mi12040443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21-micromachines-13-00458] 21.Adda C., Laurent G.J., Le Fort-Piat N. Learning to control a real micropositioning system in the STM-Q framework; Proceedings of the 2005 IEEE International Conference on Robotics and Automation; Barcelona, Spain. 18–22 April 2005; pp. 4569–4574. [Google Scholar]

[B22-micromachines-13-00458] 22.Li J., Li Z., Chen J. Reinforcement learning based precise positioning method for a millimeters-sized omnidirectional mobile microrobot; Proceedings of the International Conference on Intelligent Robotics and Applications; Wuhan, China. 15–17 October 2008; pp. 943–952. [Google Scholar]

[B23-micromachines-13-00458] 23.Shi H., Shi L., Sun G., Hwang K.S. Adaptive Image-Based Visual Servoing for Hovering Control of Quad-Rotor. IEEE Trans. Cogn. Dev. Syst. 2019;12:417–426. doi: 10.1109/TCDS.2019.2908923. [DOI] [Google Scholar]

[B24-micromachines-13-00458] 24.Zheng N., Ma Q., Jin M., Zhang S., Guan N., Yang Q., Dai J. Abdominal-waving control of tethered bumblebees based on sarsa with transformed reward. IEEE Trans. Cybern. 2018;49:3064–3073. doi: 10.1109/TCYB.2018.2838595. [DOI] [PubMed] [Google Scholar]

[B25-micromachines-13-00458] 25.Tang L., Yang Z.X., Jia K. Canonical correlation analysis regularization: An effective deep multiview learning baseline for RGB-D object recognition. IEEE Trans. Cogn. Dev. Syst. 2018;11:107–118. doi: 10.1109/TCDS.2018.2866587. [DOI] [Google Scholar]

[B26-micromachines-13-00458] 26.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]

[B27-micromachines-13-00458] 27.Sutton R.S., McAllester D.A., Singh S.P., Mansour Y. Policy gradient methods for reinforcement learning with function approximation; Proceedings of the Advances in Neural Information Processing Systems; Denver, CO, USA. 27–30 November 2000; pp. 1057–1063. [Google Scholar]

[B28-micromachines-13-00458] 28.Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M. Deterministic policy gradient algorithms; Proceedings of the International Conference on Machine Learning (PMLR); Bejing, China. 22–24 June 2014; pp. 387–395. [Google Scholar]

[B29-micromachines-13-00458] 29.Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous control with deep reinforcement learning. arXiv. 20151509.02971 [Google Scholar]

[B30-micromachines-13-00458] 30.Latifi K., Kopitca A., Zhou Q. Model-free control for dynamic-field acoustic manipulation using reinforcement learning. IEEE Access. 2020;8:20597–20606. doi: 10.1109/ACCESS.2020.2969277. [DOI] [Google Scholar]

[B31-micromachines-13-00458] 31.Leinen P., Esders M., Schütt K.T., Wagner C., Müller K.R., Tautz F.S. Autonomous robotic nanofabrication with reinforcement learning. Sci. Adv. 2020;6:eabb6987. doi: 10.1126/sciadv.abb6987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32-micromachines-13-00458] 32.Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]

[B33-micromachines-13-00458] 33.Zeng Y., Wang G., Xu B. A basal ganglia network centric reinforcement learning model and its application in unmanned aerial vehicle. IEEE Trans. Cogn. Dev. Syst. 2017;10:290–303. doi: 10.1109/TCDS.2017.2649564. [DOI] [Google Scholar]

[B34-micromachines-13-00458] 34.Guo X., Yan W., Cui R. Event-triggered reinforcement learning-based adaptive tracking control for completely unknown continuous-time nonlinear systems. IEEE Trans. Cybern. 2019;50:3231–3242. doi: 10.1109/TCYB.2019.2903108. [DOI] [PubMed] [Google Scholar]

[B35-micromachines-13-00458] 35.Zhang J., Shi P., Xia Y., Yang H., Wang S. Composite disturbance rejection control for Markovian Jump systems with external disturbances. Automatica. 2020;118:109019. doi: 10.1016/j.automatica.2020.109019. [DOI] [Google Scholar]

[B36-micromachines-13-00458] 36.Ahmed S., Wang H., Tian Y. Adaptive high-order terminal sliding mode control based on time delay estimation for the robotic manipulators with backlash hysteresis. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1128–1137. doi: 10.1109/TSMC.2019.2895588. [DOI] [Google Scholar]

[B37-micromachines-13-00458] 37.Chen M., Xiong S., Wu Q. Tracking flight control of quadrotor based on disturbance observer. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1414–1423. doi: 10.1109/TSMC.2019.2896891. [DOI] [Google Scholar]

[B38-micromachines-13-00458] 38.Zhao Z., He X., Ahn C.K. Boundary disturbance observer-based control of a vibrating single-link flexible manipulator. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:2382–2390. doi: 10.1109/TSMC.2019.2912900. [DOI] [Google Scholar]

[B39-micromachines-13-00458] 39.Alibekov E., Kubalík J., Babuška R. Policy derivation methods for critic-only reinforcement learning in continuous spaces. Eng. Appl. Artif. Intell. 2018;69:178–187. doi: 10.1016/j.engappai.2017.12.004. [DOI] [Google Scholar]

[B40-micromachines-13-00458] 40.Hasselt H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010;23:2613–2621. [Google Scholar]

[B41-micromachines-13-00458] 41.Zhang S., Sun C., Feng Z., Hu G. Trajectory-Tracking Control of Robotic Systems via Deep Reinforcement Learning; Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM); Bangkok, Thailand. 18–20 November 2019; pp. 386–391. [Google Scholar]

[B42-micromachines-13-00458] 42.Kiumarsi B., Vamvoudakis K.G., Modares H., Lewis F.L. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2017;29:2042–2062. doi: 10.1109/TNNLS.2017.2773458. [DOI] [PubMed] [Google Scholar]

[B43-micromachines-13-00458] 43.Yang X., Zhang H., Wang Z. Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control; Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC); Kunming, China. 22–24 May 2021; pp. 59–64. [Google Scholar]

[B44-micromachines-13-00458] 44.Xiao X., Xi R., Li Y., Tang Y., Ding B., Ren H., Meng M.Q.H. Design and control of a novel electromagnetic actuated 3-DoFs micropositioner. Microsyst. Technol. 2021;27:1–10. doi: 10.1007/s00542-020-05163-3. [DOI] [Google Scholar]

[B45-micromachines-13-00458] 45.Tommasino P., Caligiore D., Mirolli M., Baldassarre G. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Trans. Cogn. Dev. Syst. 2016;11:292–317. [Google Scholar]

[B46-micromachines-13-00458] 46.Srikant R., Ying L. Finite-time error bounds for linear stochastic approximation andtd learning; Proceedings of the Conference on Learning Theory (PMLR); Phoenix, AZ, USA. 25–28 June 2019; pp. 2803–2830. [Google Scholar]

[B47-micromachines-13-00458] 47.Feng Z., Ming M., Ling J., Xiao X., Yang Z.X., Wan F. Fractional delay filter based repetitive control for precision tracking: Design and application to a piezoelectric nanopositioning stage. Mech. Syst. Signal Process. 2022;164:108249. doi: 10.1016/j.ymssp.2021.108249. [DOI] [Google Scholar]

PERMALINK

Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

Shiyun Liang

Ruidong Xi

Xiao Xiao

Zhixin Yang

Roles

Abstract

1. Introduction

Table 1.

2. System Description

Figure 1.

Assumption 1.

Remark 1.

3. Design of ASMDO and DDPG-ID Algorithm

3.1. Design of Adaptive Sliding Mode Disturbance Observer

Remark 2.

3.2. Design of DDPG-ID Algorithm for Micropositioner

Figure 2.

Figure 3.

3.2.1. Critic Update

Figure 4.

Table 2.

3.2.2. Actor Update

Figure 5.

Table 3.

4. Simulation and Experimental Results

4.1. Simulation Results

Figure 6.

Figure 7.

Table 4.

Table 5.

Figure 8.

Figure 9.

Table 6.

Figure 10.

Table 7.

Figure 11.

Table 8.

4.2. Experimental Results

Figure 12.

Figure 13.

Figure 14.

5. Conclusions and Future Works

Abbreviations

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases