Skip to main content
Micromachines logoLink to Micromachines
. 2022 Mar 17;13(3):458. doi: 10.3390/mi13030458

Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

Shiyun Liang 1,, Ruidong Xi 1,, Xiao Xiao 2, Zhixin Yang 1,*
Editor: Duc Truong Pham
PMCID: PMC8955352  PMID: 35334749

Abstract

The motion control of high-precision electromechanitcal systems, such as micropositioners, is challenging in terms of the inherent high nonlinearity, the sensitivity to external interference, and the complexity of accurate identification of the model parameters. To cope with these problems, this work investigates a disturbance observer-based deep reinforcement learning control strategy to realize high robustness and precise tracking performance. Reinforcement learning has shown great potential as optimal control scheme, however, its application in micropositioning systems is still rare. Therefore, embedded with the integral differential compensator (ID), deep deterministic policy gradient (DDPG) is utilized in this work with the ability to not only decrease the state error but also improve the transient response speed. In addition, an adaptive sliding mode disturbance observer (ASMDO) is proposed to further eliminate the collective effect caused by the lumped disturbances. The micropositioner controlled by the proposed algorithm can track the target path precisely with less than 1 μm error in simulations and actual experiments, which shows the sterling performance and the accuracy improvement of the controller.

Keywords: micropositioners, reinforcement learning, disturbance observer, deep deterministic policy gradient

1. Introduction

Micropositioning technologies based on smart materials in precision industries have gained much attention for numerous potential applications in optical steering, micro-assembly, nano-inscribing, cell manipulation, etc. [1,2,3,4,5,6,7]. One of the greatest challenge in this research field is the uncertainties produced by various factors such as the dynamic model, environmental temperature, sensors performance, and the actuators’ nonlinear characteristics [8,9], which make the control of micropositioning system a demanding problem.

To address the uncertain problem, different kinds of control approach have been developed, such as the PID control method [10], sliding mode control [11,12], and adaptive control [13]. In addition, many researchers have integrated these control strategies to further improve the control performance. Victor et al. have proposed a scalable field-programmable gate array-based motion control system with a parabolic velocity profile [14]. A new seven-segment profile algorithm was developed by Jose et al. to improve the performance of the motion controller [15]. Combined with the backstepping strategy, Fei et al. proposed an adaptive fuzzy sliding mode controller in [16]. Based on the radial basis function neural network (RBFNN) and sliding mode control (SMC), Ruan et al. developed a RBFNN-SMC for nonlinear electromechanical actuator systems [17]. Gharib et al. designed a PID controller with a feedback linearization technique for path tracking control of a micropositioner [18]. Nevertheless, the performance and robustness of such model-based control strategies are still limited by the precision of the dynamics model. On the other hand, a sophisticated system model frequently leads to a complex control strategy. Although many researchers have considered the factors of uncertainties and disturbances, it is still difficult for the system to provide a precise and comprehensive process.

As the rapid development in artificial intelligence in recent years has roundly impacted the traditional control field, learning-based and data-driven approaches, especially reinforcement learning (RL) and neural networks, have become a promising research tropic. Different from traditional control strategies that need to make assumptions based on the dynamics model [19,20], reinforcement learning can directly learn the policy by interacting with the system. Back in 2005, Adda et al. presented a reinforcement learning algorithm for learning control of stochastic micromanipulation systems [21]. Li et al. designed a state–action–reward–state–action (SARSA) method using linear function approximation to generate an optimal path by controlling the choice of the micropositioner [22]; however, the reinforcement learning algorithms such as Q-learning [23] and SARSA [24] utilized in the aforementioned works are unable to deal with complex dynamics problems, especially the continuous state action space problem. With the spectacular improvement enjoyed by deep reinforcement learning (DRL), primarily driven by deep neural networks (DNN) [25], the DRL algorithms, such as the deep Q network (DQN) [26], policy gradient (PG) [27], deterministic policy gradient (DPG) [28], and deep deterministic policy gradient (DDPG) [29] with the ability to approximate the value function, have played an important role in continuous control tasks.

Latifi et al. introduced a model-free neural fitted Q iteration control method for micromanipulation devices; in this work, the DNN is adopted to represent Q-value function [30]. Leinen introduced the concept of experience playback in DQN and the approximate value function of the neural network into the SARSA algorithm for the control of a scanning probe microscope [31]. Both simulation and real experimental results have shown that their proposed RL algorithm based on the neural network could achieve better performance compared to traditional control methods to some extent; however, due to the collective effects of disturbances generated from nonlinear systems and deviations in value functions [29,32,33], the RL control method could induce significant inaccuracies in the tracking control tasks [34]. To improve the anti-disturbance capability and control accuracy, disturbance rejection control [35], time-delay estimation based control [36], disturbance observer-based controllers [37,38] have been proposed successively. To deal with this issue, a deep reinforcement learning controller integrated with an adaptive sliding mode disturbance observer (ASMDO) is developed in this work. Previous research on trajectory tracking control of DRL has shown that apparent state errors have always existed [39,40,41,42]. One of the main reasons is the inaccurate estimation of the action value function in DRL structure. As indicated in [43], even in elementary control tasks, accurate action values cannot be attained from the same action value function; therefore, in this work, the DDPG algorithm is developed with an integral differential compensator (DDPG-ID) added to cope with this situation. In addition, the comparison of the reinforcement learning control method with various common state-of-the-art control methods are listed in Table 1, which shows the pros and cons of these different methods.

Table 1.

Comparison of different control algorithms.

Method Advantages Disadvantages

PID control
Simple design structure
Easy to implementation
Mainly used in linear systems
Requirement of full-state feedback
Lack of adaptivity

SMC control
Simple design structure
Easy to implementation
High robustness
Excessive chattering effect
Lack of adaptivity

Adaptive control
Lower initial cost
Lower cost of redundancy
High reliability and performance
Stability is not treated rigorously
High gain observes needed
Slow convergence

Backstepping control
Global stability
Simple design structure
Easy to be integrated
Low anti-interference ability
Sensitive to system models
Lack of adaptivity

RL control
No need of accurate model
Improved control performance
High adaptivity
Poor anti-interference ability
Easy to generate state error

In this study, deep reinforcement learning is leveraged into a novel optimal control scheme for complex systems. An anti-disturbance, stable, and precise control strategy is proposed for the trajectory tracking task of the micropositioner system. The contribution of this works are presented as follows:

  • (1)

    A DDPG-ID algorithm based on deep reinforcement learning is introduced as a basic micropositioner system motion controller, which avoided the limitation of traditional control strategies to the accuracy and comprehensiveness of the dynamic model;

  • (2)

    To eliminate the collective effect caused by the lumped disturbances from the micropositioner system and inaccurate estimation of the value function in deep reinforcement learning, an adaptive sliding mode disturbance observer (ASMDO) is proposed;

  • (3)

    An integral differential compensator is introduced in DDPG-ID to compensate for the feedback state of the system, which improves the accuracy and response time of the controller, and further improves the robustness of the controller subject to external disturbances.

The manuscript is structured as follows. Section 2 presents the system description of the micropositioner. In Section 3, we develop a deep reinforcement learning control method combined with ASMDO and compensator, and parameters of the DNNs are illustrated. Then, simulation parameters and tracking results are given in Section 4. To further evaluate the performance of the proposed control strategy in the micropositioner, tracking experiments are presented in Section 4. Lastly, conclusions are given in Section 5.

2. System Description

The basic structure of micropositioner is shown in Figure 1, which consists of a base, a platform, and a kinematic device. The kinematic device is composed with an armature, an electromagnetic actuator, and a chain mechanism driven by electromagnetic actuator. As shown in Figure 1, there are mutual-perpendicular compliant chains actuated by the electron-magnetic actuator (EMA) in the structure. The movement of the chain mechanism is in accordance with the working air gap y. The EMA generates the magnetic force Tm, which can be approximated as:

Tm=kIcy+p2 (1)

where k and p are constant parameters related to the electronmagnetic actuator, Ic is the excitation current, and y is the working air gap between the armature and the EMA. Then, the electrical model of the system can be given as:

Vi=RIc+ddtHIc (2)

where Vi is the input voltage from the EMA, R is the resistance of the coil and H denotes the coil inductance, which can be given as:

H=H1+pH0y+p (3)

where H1 is the coil inductance while the air gap is infinite, and H0 is the incremental inductance when the gap is zero. The motion equation for the micropositioner can be expressed as:

md2ydt2=ια0yTm (4)

where ι is the stiffness along the motion direction in the system, and α0 is the initial air gap.

Figure 1.

Figure 1

The diagrammatic model of EMA actuated micropositioner. (a) The front view of micropositioner. (b) The end view of micropositioner. (c) The vertical view of micropositioner.

According to Equations (1)–(4), they define x1=y, x2=y˙, x3=Ic as the state variables and the control input u=Vi. Then, the dynamics model of the electromagnetic actuator can be written as:

x˙1=x2x˙2=ιmα0x1kmx3x1+p2x˙3=1HRx3+H0px2x3x1+p2+u (5)

Define the variables z1=x1, z2=x2, z3=ιmα0x1kmx3x1+p2, then we have

z˙1=z2z˙2=z3z˙3=f(x)+g(x)u (6)

where f(x)=ιx2m+2kx32mx1+p2Hx1+ppH0Hx1+p2x2+RH, g(x)=2kx3Hmx1+p2, and z1 is the system output.

In realistic engineering application, there always exist some uncertainties of the system, then system Equation (6) can be rewritten as:

z˙i=zi+1,i=1,2z˙3=f0(x)+g0(x)u+(Δf(x)+Δg(x)u)+d (7)

where f0(x) and g0(x) denote the nominal part of the micropositioner system and Δf(x), Δg(x) denote the uncertainties of the modeling system; d denotes the external disturbances. Then, defining D=(Δf(x)+Δg(x)u)+d, we have

z˙i=zi+1,i=1,2z˙3=f0(x)+g0(x)u+D (8)

where D is the lumped system disturbances. The following assumption is exploited [44]:

Assumption 1.

The lumped interference D is bounded and its upper bound is less than a fixed parameter β1 and the derivative of D is unknown but bounded.

Remark 1.

Assumption 1 is reasonable since all micropositioner platforms are accurately designed and parameter identified, and all disturbances are remained in a controllable domain.

3. Design of ASMDO and DDPG-ID Algorithm

In this section, the adaptive sliding mode disturbance observer (ASMDO) is introduced based on the dynamics of the micropositioner. Then, the DDPG-ID control method and pseudocode are given.

3.1. Design of Adaptive Sliding Mode Disturbance Observer

To develop the ASMDO, a virtual dynamic is firstly designed as

η˙i=ηi+1,i=1,2η˙3=f(z)+g(z)u+D^+ρ (9)

where ηi,i=1,2,3 are auxiliary variables, D^ is the estimation of lumped disturbances, ρ denotes the sliding mode term, which is introduced afterwards.

Define a sliding variable S=σ3+k2σ2+k1σ1, where σi=xiηi,i=1,2,3, k1 and k2 are positive design parameters. Then the sliding mode term ρ is designed as

ρ=λ1S+k2σ3+k1σ2+λ2sgn(S) (10)

where λ1, λ2 are positive design parameters with λ2β1.

Choosing an unknown constant β2 to present the upper bound of D˙, the ASMDO is proposed as:

D^˙=k(x˙3f0(z)g0(z)uD^)+(β^2+λ3)sgn(ρ) (11)

where k and λ3 are positive design parameters and β^2 is defined as the estimation of β2 given by β^˙2=δ0β^2+ρ, with δ0 is a small positive number.

Then, the output D^ of the ASMDO is used as a compensation of the control input to eliminate the uncertainties generated by the system and external disturbances.

Remark 2.

Choosing V1=12S2 and V2=12(D˜2+β˜22), where D˜=DD^, β˜2=β2β^2 as two Lyapunov function, derivative V1 and V2 with respect to time, it is easy to prove that both S and D˜ will exponentially converge to the equilibrium point, so the proof process is not repeated.

3.2. Design of DDPG-ID Algorithm for Micropositioner

The goal of reinforcement learning is to obtain a policy for the agent that could maximizes the cumulative reward through interactions with the environment. The environment is usually formalized as a Makov decision process (MDP) described by a four-tuple (S,A,P,R), where S, A, P, and R represent the state space of environment, set of actions, state transition probability function, and reward function separately. At each time step t, the agent in current state stS takes action atA from policy π(at|st), then the agent acquires a reward rtR(st,at) and enters the next state st+1 according to the state transition probability function P(st+1|st,at). Based on the Markov property, the Bellman equation of action–value function Qπ(st,at), which is used for calculating the future expected reward, can be given as:

Qπ(st,at)=Eπrt+γQπ(st+1,at+1) (12)

where γ[0,1] denotes the discount factor.

In trajectory tracking control task of micropositioner, state st is state array about the air gap y of micropositioner at time t. Action at is the voltage u applied by the controller to micropositioner. As shown in Figure 2, DDPG is one of actor–critic algorithms, which has an actor and a critic. The actor is responsible for generating actions and interacting with the environment, and the critic evaluates the performance of the actor and guides the action in the next state.

Figure 2.

Figure 2

The structure diagram of DDPG-ID algorithm.

The action–value function and policy approximation are parameterized by DNN to solve the continuous states and actions problem in micropositioner with Q(st,at,wQ)Qπ(st,at), πwμ(at|st)π(at|st), where wQ and wμ are the parameters of neural networks in action–value function and policy function. Under the prerequisite of using the neural network approximation representation policy function, the neural network gradient update method is used to seek the optimal policy π.

DDPG-ID uses deterministic policy π(st,wμ) rather than traditional stochastic policy πwμ(at|st), where the output of policy is the action at with highest probability to current state st, π(st,wμ)=at. The policy gradient is given as

wμJ(π)=Esρπ[wμπ(s,wμ)aQ(s,a,wQ) (13)

where J(π)=Eπ[t=1Tγ(t1)rt] is the expectation of discount accumulative rewards, T denotes the final time of a whole process, ρπ is the distribution of state following the deterministic policy. Value function Q(st,at,wQ) is updated by calculating time temporal-difference error (TD-error), which can be defined as

eTD=rt+γQ(st+1,π(st+1))Q(st,at) (14)

where eTD is the TD-error, rt+γQ(st+1,π(st+1)) represents the TD target value. By minimizing the TD-error, the parameters are updated backwards through the neural network gradient.

To avoid the convergence problem of single network caused by correlation between TD target value and current value [45,46], A target Q network QT(st+1,at+1,wQ) is introduced to calculate network portion of TD target value and an online Q network QO(st,at,wQ) is used to calculate current value in critic. Both these two DNN have the same structure. The actor also has an online policy network πO(st,wμ) to generate current action and a target policy network πT(st,wμ) to provide the target action at+1. wμ and wQ separately represent the parameters of target policy and target Q networks.

In order to improve the stability and efficiency during RL training, experience replay technology is utilized in this work, which saves transition experience (st,at,rt,st+1) into the experience replay buffer Ψ at each interaction with the environment for subsequent updates. In each training time t, a minibatch of M transitions (sj,aj,rj,sj+1) from the experience replay buffer are extracted to calculate the gradients and update neural networks.

An integral differential compensator is developed in deep reinforcement learning structure to improve the accuracy and responsiveness of tracking tasks in this work, which is shown in Figure 2. The integral portion of the state is utilized to increase the control input continuously, which would eventually reduce tracking error. The differential part is integrated to reduce the system oscillation and accelerates stability. The proposed compensator is designed as follows:

sIDt=yet+αn=1tyet+βyetyet1 (15)

where sIDt represents the compensator error at time t, yet=ydty^t2, ydt represents the desired trajectory at time t, y^t is the measured air gap at time t and yet is the error between them. α is the integral gain and β is the differential gain.

Then the state st at time t can be described as:

st=sIDty^ty^˙tydty˙dtT (16)

where y^˙t and y˙dt represent the derivatives of y^t and ydt.

The reward rt function designed is to measure the tracking error:

rt=4,yet>0.005+5,0.003<yet0.005+10,0.001<yet0.003+18,yet0.001 (17)

As shown in Figure 3, the adaptive sliding mode disturbance observer (ASMDO) is embedded into the DDPG-ID between the actor and micropositioner system environment. Action at with the environment is expressed as

at=πO(st,wμ)+D^t+Nt (18)

where wμ is the parameters of online policy network πO, D^t is the estimation of the micropositioner system at time t, and Nt is Gaussian noise for action exploration.

Figure 3.

Figure 3

System signal flow chart.

3.2.1. Critic Update

After selecting M transitions (sj,aj,rj,sj+1) samples from experience replay buffer Ψ, the Q value is calculated. The online Q network is responsible for calculating the current Q value, which is as follows:

QO(sj,aj,wQ)=wQϕ(sj,aj) (19)

where ϕ(sj,aj) represents the input of online Q network, which is an eigenvector consisting of state sj and action aj.

The target Q network QT is defined as:

QT(sj+1,πT(sj+1,wμ),wQ)=wQϕ(sj+1,πT(sj+1,wμ)) (20)

where ϕ(sj+1,πT(sj+1,wμ)) is the input of the target Q network, which is a eigenvector consisting state sj+1 and target policy network output πT(sj+1,wμ).

For target policy network πT, the equation is:

πT(sj+1,wμ)=wμsj+1 (21)

Then, we rewrite the target Q value QT as:

QT=rj+γQT(sj+1,πT(sj+1,wμ),wQ) (22)

where rj is the reward from the selected samples.

Since M transitions (sj,aj,rj,sj+1) are sampled from experience buffer Ψ, the loss function of the update critic is shown in Equation (23).

LwQ=1Mj=1MQTQOsj,aj,wQ2 (23)

where LwQ is the loss value of critic.

In order to smooth the target network update process, the soft update is applied without copying parameters periodically as:

wQτwQ+(1τ)wQ (24)

where τ is the update factor, usually a small constant.

The diagram of Q network is shown in Figure 4, which is a parallel neural network. The Q network includes both state and action portions, and the output value of Q network is based on state and action. The state portion of the neural network consists of a state input layer, three full connection layers, and two ReLU layers clamped between the three full connection layers. The neural network of the action portion contains an action input layer and a full connection layer. The output layers of the above two portions are combined entering the neural network of the common part, which contains a ReLU layer and one output layer.

Figure 4.

Figure 4

The diagram of Q network.

The parameters of each layer in the Q network are shown in Table 2.

Table 2.

Q network parameters.

Network Layer Name Number of Nodes
StateLayer 5
CriticStateFC1 120
CriticStateFC2 60
CriticStateFC3 60
ActionInput 1
CriticActionFC1 60
addLayer 2
CriticOutput 1

3.2.2. Actor Update

The output of online policy network is

πO=wμsj (25)

On account of using deterministic policy, the calculation of the policy gradient has no integrals of action a, but instead has the derivatives of the value function QO with respect to action a in comparison with stochastic policy. The gradient formula can be rewritten as follows:

wμJ1MjM(ajQO(sj,aj,wQ)wμπOsj,wμ) (26)

where the weights wμ are updated with the gradient back-propagation method. The target policy network is also updated with soft update pattern as follows:

wμτwμ+(1τ)wμ (27)

where τ is the update factor, usually a small constant.

Figure 5 shows the diagram of the policy network in this paper, which contains a state input layer, a full connection layer, a tanh layer, and an output layer. The parameters of each layer in the policy network are shown in Table 3.

Figure 5.

Figure 5

The diagram of policy network.

Table 3.

Policy network parameters.

Network Layer Name Number of Nodes
StateLayer 5
ActorFC1 30
ActorOutput 1

The Algorithm 1 pseudocode can be shown as:

Algorithm 1 DDPG-ID Algorithm.
  •   1:

    Randomly initialize online Q network with weights wQ

  •   2:

    Randomly initialize online policy network with weights wμ

  •   3:

    Initialize the target Q network by wQwQ

  •   4:

    Initialize the target policy network by wμwμ

  •   5:

    Initialize the experience replay buffer Ψ

  •   6:

    Load the simplified micropositioner dynamic model

  •   7:

    for episode = 1, MaxEpisode do

  •   8:

        Initialize a noise process N for exploration

  •   9:

        Initialize ASMDO and ID compensator

  • 10:

        Randomly initialize micropositioner states

  • 11:

        Receive initial observation state s1

  • 12:

        for step = 1, T do

  • 13:

           Select action at=πO(st)+D^t+Nt

  • 14:

           Use at to run micropositioner system model

  • 15:

           Process errors with integral differential compensator

  • 16:

           Receive reward rt and new state st+1

  • 17:

           Store transition (st,at,rt,st+1) in replay buffer Ψ

  • 18:

           Randomly sample a minibatch of M transitions (sj,aj,rj,sj+1) from Ψ

  • 19:

           Set QT=rj+γQT(sj+1,πT(sj+1,wμ),wQ)

  • 20:

           Minimize loss: L(wQ)=1Mj=1M(QTQO(sj,aj,wQ))2 to update online Q network

  • 21:

           Use the sampled policy gradient to update online policy network:

           wμJ=1MjM(ajQO(sj,aj,wQ)wμπOsj,wμ)

  • 22:

           Update the target networks: wQτwQ+(1τ)wQ,wμτwμ+(1τ)wμ

  • 23:

      end for

  • 24:

    end for

4. Simulation and Experimental Results

In this section, two kinds of periodic external disturbances were added to verify the practicability of the proposed ASMDO and three distinct desired trajectories were utilized to evaluate the performance of proposed deep reinforcement learning control strategy. An traditional DDPG algorithm and a well-tuned PID strategy were adopted for comparison. To further verify the spatial performances of the proposed algorithm, two kinds of different trajectories were introduced in the experiments.

4.1. Simulation Results

The parametric equations of two kinds of periodic external disturbances are defined as d1=0.1sin(2πt)+0.1sin(0.5πt+π3), and d2=0.1+0.1sin(0.5πt+π3). Based on the micropositoner model proposed in [44], the effectiveness of the observer is presented in Figure 6 and Figure 7.

Figure 6.

Figure 6

Observation result of ASMDO with d2. (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.

Figure 7.

Figure 7

Observation result of ASMDO with d1. (a) Observing result based on the ASMDO. (b) Observing error based on the ASMDO.

The disturbance estimation results from the proposed ASMDO are presented in Figure 6a and Figure 7a, it is can be seen that the observer could track the given disturbance rapidly. The estimation errors are less than 0.01 mm in Figure 6b and Figure 7b, which shows the effectiveness of the ASMDO as interference compensation.

The dynamics model of micropositioner is given in Section 2, and its basic system model parameters are from our previous research [44,47], which is shown in Table 4. The DDPG algorithm is defined in same neural network structure and training parameters as DDPG-ID in this paper. The training parameters of the DDPG-ID and DDPG are shown in Table 5.

Table 4.

Parameters of the micropositioner model.

Notation Value Unit
L1 13.21 H
L0 0.67 H
a 1.11×105 m
R 43.66 Ω
c 8.83×105 Nm2A2
k 1.803×10N5 Nm1
m 0.0272 Kg

Table 5.

Training parameters of DDPG-ID and DDPG.

Hyperparameters Value
Learning rate for actor φ1 0.001
Learning rate for critic φ2 0.001
Discount factor γ 0.99
Initial exploration ε 1
Experience replay buffer size ψ 100,000
Minibatch size M 64
Max episode ϖ 1500
Soft update factor τ 0.05
Max exploration steps T 250 (25 s)
Time step Ts 0.01 s
Intergal gain α 0.01
Differential gain β 0.001

The first desired trajectory designed for tracking control simulation is a waved signal. According to the initial conditions, the parametric equation of the waved trajectory is defined as:

yd(t)=0.9850.015sin(πt4π2) (28)

The training process of both DDPG-ID and DDPG are run on the same model with stochastic initialized micropositioner states. During the training evaluation, a larger episode reward indicates a more accurate and lower error control policy. It is shown in Figure 8 that DDPG-ID reaches the maximum reward score with fewer episodes compared to DDPG, which reveals that DDPG-ID algorithm converge faster than DDPG algorithm. Comparing Figure 8a with Figure 8b, the average reward of DDPG-ID training process is larger than DDPG’s average reward in stable state, which further indicates that policy learned by DDPG-ID algorithm has better performance. The trained algorithms are employed for tracking control of micropositioner system simulation experiments.

Figure 8.

Figure 8

The training rewards of two RL schemes. (a) The training rewards generated by DDPG-ID. (b) The training rewards generated by DDPG.

The tracking results of the waved trajectory is shown in Figure 9. The RMSE value, MAX value, and mean value of the tracking errors for these three control methods are provided in Table 6. In terms of tracking accuracy, the trained DDPG-ID controller has a better performance compared to DDPG and PID, which has smaller state error and smoother tracking trajectory. The tracking error of the DDPG-ID algorithm ranges from 8×104 to 9×104 mm, which is almost about a half of the DDPG policy. In the interim, the DDPG controller has a lesser tracking error than PID. A huge oscillation has been induced by the PID controller, which will affect the hardware to a certain extent in the actual operation process. This huge oscillation input signal is much larger than a normal control input signal, which typically ranges from 0 to 11 V. Based on the characteristics of reinforcement learning, it is hard for a well-trained policy to generate such a shock signal.

Figure 9.

Figure 9

Tracking results comparison of the waved trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.

Table 6.

Tracking errors comparison of different controllers in the waved trajectory.

RMSE MAX MEAN
DDPG-ID 3.658×104 4.758×104 1.003×104
DDPG 1.093×103 2.615×103 4.414×104
PID 1.654×103 3.144×104 3.104×104

As can be seen in these figures, the tracking error of DDPG-ID in periodic trajectory is still less than the others, which ranges from 1.6×104 to 9×104 mm. Similar to the previous waved trajectory, the control input based on DDPG has shown better performance in terms of oscillations.

Another tracking results of a periodic trajectory is illustrated in Figure 10, and the tracking errors comparison of these three control methods are given in Table 7. The parametric equation of the periodic trajectory is defined as

yd(t)=0.9810.015sin(πt4π2)+0.008sin(πt2π16). (29)

Figure 10.

Figure 10

Figure 10

Tracking results comparison of the periodic trajectory. (a) Tracking results comparison based on three control schemes. (b) Tracking error comparison based on three control schemes. (c) Control input comparison based on three control schemes.

Table 7.

Tracking errors comparison of different controllers in the periodic trajectory.

RMSE MAX MEAN
DDPG-ID 4.272×104 8.471×104 5.404×105
DDPG 1.545×103 3.102×103 1.610×104
PID 1.923×103 3.376×103 3.311×104

To further demonstrate the universality of the DDPG-ID policy, a periodic step trajectory is also utilized for comparison. The step signal with a period of 8 s is designed as the desired trajectory, which is shown in Figure 11a. The well-tuned PID controller is also tested in this step trajectory simulation. Since intense oscillations emerge, the results of PID show extremely worse performance are not shown in this paper.

Figure 11.

Figure 11

Figure 11

Tracking results comparison of the step trajectory. (a) Tracking results comparison based on two control schemes. (b) Tracking error comparison based on two control schemes. (c) Control input comparison based on two control schemes.

According to Figure 11, the tracking result of DDPG-ID algorithm remains stable with the tracking error bounded in 2×104 to 9×104 mm, which is still as a half of DDPG’s performance. Due to the characteristic of the step signal, the state error will become tremendous during the step transition. Errors of DDPG-ID and DDPG are observed dropping quickly after step transition. It can be seen from Table 8 that the errors of DDPG-ID algorithm are substantially less than that of DDPG algorithm. As to the control inputs, the value of DDPG still fluctuates considerably when the state converges stable.

Table 8.

Tracking errors comparison of different controllers in the step trajectory.

RMSE MAX MEAN
DDPG-ID 4.612×103 0.02953 6.938×104
DDPG 5.279×103 0.02986 1.437×103

According to above simulation results, it can be concluded that the control policy of DDPG-ID has triumphantly dealt with collective effect caused by disturbance and inaccurate estimation of deep reinforcement learning comparing to DDPG. The comparison results also have demonstrated the excellent control performance of the policy learned by DDPG-ID algorithm.

4.2. Experimental Results

The speed, acceleration, and direction of these designed trajectories vary with time, which makes the experiments results more trustworthy. In each test, the EMA in micropositioner is regulated for tracking the desired path of working air gap.

As shown in Figure 12, a laser displacement sensor is utilized to detect the motion states. Then DDPG-ID algorithm was administered through a SimLab board transplanted with Matlab-Simulink. The EMA controls the movement of the chain mechanism by executing the control signal, which is from the analog output port of SimLab board. The analog input port of SIMLAB board is connected with the signal output from the laser displacement sensor.

Figure 12.

Figure 12

The schematic diagram of experiment system.

Figure 13 shows the tracking experiment results of the waved trajectory. It reaches the starting point on a straight track with a speed of 5.6 μm/s. At time 5 s, it begins to track the desired waved trajectory in three periods, and the waved trajectory can be described as yd(t)=28+25sin(πt10+π2). The tracking error fluctuates within ±1.5μm, which is demonstrated in Figure 13b. Except for several particular points of time, the tracking errors could range from ±1 μm.

Figure 13.

Figure 13

Tracking results of the waved trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.

Another periodic trajectory tracking experiment was also executed. As shown in Figure 14, the desired periodic trajectory starts at time 5 s, and it is defined as yd(t)=3525sin(πt7.52π3)5sin(πt15+π6). The tracking error of the periodic trajectory still range from ±1.5μm.

Figure 14.

Figure 14

Tracking results comparison of the step trajectory. (a) Tracking result of desired trajectory. (b) Tracking error of desired trajectory.

The experimental results show that the proposed DDPG-ID algorithm is able to closely track above two trajectories. Compared with the simulation results, the tracking error does not increase significantly, and it can be maintained between −1 μm and +1 μm.

5. Conclusions and Future Works

In this paper, a composite controller is developed based on an adaptive sliding mode disturbance observer and a deep reinforcement learning control scheme. A deep deterministic policy gradient is utilized to obtain the optimal control performance. To improve the tracking accuracy and transient response time, an integral differential compensator is applied during the learning process in the actor–critic framework. An adaptive sliding mode disturbance observer is developed to further retrench the influence of modeling uncertainty, external disturbances, and the effect of inaccurate value function. In comparison with the existing DDPG and the most commonly used PID controller, the trajectory tracking results has successfully indicated the satisfactory performances and the precision of the control policy based on the DDPG-ID algorithm in the simulation. The tracking errors are less than 1 μm, which shows the significant tracking efficiency of the proposed methods. The experimental results also indicate the high accuracy and strong anti-interference capability of the proposed deep reinforcement learning control scheme. To further improve the tracking effect and realize micro-manipulation tasks in the future work, specific operation experiments will be performed such as cell manipulation, micro-assembly, etc.

Abbreviations

PID Proportional–integral–derivative control
RBFNN Radial basis neural network
RL Reinforcement learning
SARSA State-Action-Reward-State-Action
Q The Value of Action in reinforcement learning
DRL Deep reinforcement learning
DNN Deep neural networks
DQN Deep Q network
PG Policy gradient
DDPG Deep deterministic policy gradient
ID Integral differential compensator
Tm The magnetic force
y The working air gap in micropositioner
Ic The excitation current in micropositioner
EMA The electron-magnetic actuator
Vi The input voltage from the electron-magnetic actuator
R The resistance of the coil in micropositioner
H The coil inductance in micropositioner
u The control input
D The lumped system disturbance
ASMDO Adaptive Sliding Mode Disturbance Observer
st The state at time t in reinforcement learning
at The action at time t in reinforcement learning
rt The reward at time t in reinforcement learning
ReLU Rectified linear unit activation function
tanh Hyperbolic tangent activation function

Author Contributions

Writing—original draft preparation, S.L., R.X., X.X. and Z.Y.; writing—review and editing, S.L. and R.X.; data collection, S.L. and R.X.; visualization, S.L., R.X., X.X. and Z.Y.; supervision, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the Science and Technology Development Fund, Macau SAR (Grant No. 0018/2019/AKP and SKL-IOTSC(UM)-2021-2023), in part by the Ministry of Science and Technology of China (Grant No. 2019YFB1600700), in part by the Guangdong Science and Technology Department (Grant No. 2018B030324002 and 2020B1515130001), in part by the Zhuhai Science and Technology Innovation Bureau (Grant no. ZH22017002200001PWC), Jiangsu Science and Technology Department (Grant No. BZ2021061), and in part by the University of Macau (Grant No. MYRG2020-00253-FST).

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Català-Castro F., Martín-Badosa E. Positioning Accuracy in Holographic Optical Traps. Micromachines. 2021;12:559. doi: 10.3390/mi12050559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bettahar H., Clévy C., Courjal N., Lutz P. Force-Position Photo-Robotic Approach for the High-Accurate Micro-Assembly of Photonic Devices. IEEE Robot. Autom. Lett. 2020;5:6396–6402. doi: 10.1109/LRA.2020.3014634. [DOI] [Google Scholar]
  • 3.Cox L.M., Martinez A.M., Blevins A.K., Sowan N., Ding Y., Bowman C.N. Nanoimprint lithography: Emergent materials and methods of actuation. Nano Today. 2020;31:100838. doi: 10.1016/j.nantod.2019.100838. [DOI] [Google Scholar]
  • 4.Dai C., Zhang Z., Lu Y., Shan G., Wang X., Zhao Q., Ru C., Sun Y. Robotic manipulation of deformable cells for orientation control. IEEE Trans. Robot. 2019;36:271–283. doi: 10.1109/TRO.2019.2946746. [DOI] [Google Scholar]
  • 5.Zhang P., Yang Z. A robust adaboost. rt based ensemble extreme learning machine. Math. Probl. Eng. 2015;2015:260970. doi: 10.1155/2015/260970. [DOI] [Google Scholar]
  • 6.Yang Z., Wong P., Vong C., Zong J., Liang J. Simultaneous-fault diagnosis of gas turbine generator systems using a pairwise-coupled probabilistic classifier. Math. Probl. Eng. 2013;2013:827128. doi: 10.1155/2013/827128. [DOI] [Google Scholar]
  • 7.Wang D., Zhou L., Yang Z., Cui Y., Wang L., Jiang J., Guo L. A new testing method for the dielectric response of oil-immersed transformer. IEEE Trans. Ind. Electron. 2019;67:10833–10843. doi: 10.1109/tie.2019.2959500. [DOI] [Google Scholar]
  • 8.Roshandel N., Soleymanzadeh D., Ghafarirad H., Koupaei A.S. A modified sensorless position estimation approach for piezoelectric bending actuators. Mech. Syst. Signal Process. 2021;149:107231. doi: 10.1016/j.ymssp.2020.107231. [DOI] [Google Scholar]
  • 9.Ding B., Yang Z.X., Xiao X., Zhang G. Design of reconfigurable planar micro-positioning stages based on function modules. IEEE Access. 2019;7:15102–15112. doi: 10.1109/ACCESS.2019.2894619. [DOI] [Google Scholar]
  • 10.García-Martínez J.R., Cruz-Miguel E.E., Carrillo-Serrano R.V., Mendoza-Mondragón F., Toledano-Ayala M., Rodríguez-Reséndiz J. A PID-type fuzzy logic controller-based approach for motion control applications. Sensors. 2020;20:5323. doi: 10.3390/s20185323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Salehi Kolahi M.R., Gharib M.R., Heydari A. Design of a non-singular fast terminal sliding mode control for second-order nonlinear systems with compound disturbance. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2021;235:7343–7352. doi: 10.1177/09544062211032990. [DOI] [Google Scholar]
  • 12.Nguyen M.H., Dao H.V., Ahn K.K. Adaptive Robust Position Control of Electro-Hydraulic Servo Systems with Large Uncertainties and Disturbances. Appl. Sci. 2022;12:794. doi: 10.3390/app12020794. [DOI] [Google Scholar]
  • 13.Cruz-Miguel E.E., García-Martínez J.R., Rodríguez-Reséndiz J., Carrillo-Serrano R.V. A new methodology for a retrofitted self-tuned controller with open-source fpga. Sensors. 2020;20:6155. doi: 10.3390/s20216155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Montalvo V., Estévez-Bén A.A., Rodríguez-Reséndiz J., Macias-Bobadilla G., Mendiola-Santíbañez J.D., Camarillo-Gómez K.A. FPGA-Based Architecture for Sensing Power Consumption on Parabolic and Trapezoidal Motion Profiles. Electronics. 2020;9:1301. doi: 10.3390/electronics9081301. [DOI] [Google Scholar]
  • 15.García-Martínez J.R., Rodríguez-Reséndiz J., Cruz-Miguel E.E. A new seven-segment profile algorithm for an open source architecture in a hybrid electronic platform. Electronics. 2019;8:652. doi: 10.3390/electronics8060652. [DOI] [Google Scholar]
  • 16.Fei J., Fang Y., Yuan Z. Adaptive Fuzzy Sliding Mode Control for a Micro Gyroscope with Backstepping Controller. Micromachines. 2020;11:968. doi: 10.3390/mi11110968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ruan W., Dong Q., Zhang X., Li Z. Friction Compensation Control of Electromechanical Actuator Based on Neural Network Adaptive Sliding Mode. Sensors. 2021;21:1508. doi: 10.3390/s21041508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gharib M.R., Koochi A., Ghorbani M. Path tracking control of electromechanical micro-positioner by considering control effort of the system. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2021;235:984–991. doi: 10.1177/0959651820953275. [DOI] [Google Scholar]
  • 19.Han M., Tian Y., Zhang L., Wang J., Pan W. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee. Automatica. 2021;129:109689. doi: 10.1016/j.automatica.2021.109689. [DOI] [Google Scholar]
  • 20.de Orio R.L., Ender J., Fiorentini S., Goes W., Selberherr S., Sverdlov V. Optimization of a spin-orbit torque switching scheme based on micromagnetic simulations and reinforcement learning. Micromachines. 2021;12:443. doi: 10.3390/mi12040443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Adda C., Laurent G.J., Le Fort-Piat N. Learning to control a real micropositioning system in the STM-Q framework; Proceedings of the 2005 IEEE International Conference on Robotics and Automation; Barcelona, Spain. 18–22 April 2005; pp. 4569–4574. [Google Scholar]
  • 22.Li J., Li Z., Chen J. Reinforcement learning based precise positioning method for a millimeters-sized omnidirectional mobile microrobot; Proceedings of the International Conference on Intelligent Robotics and Applications; Wuhan, China. 15–17 October 2008; pp. 943–952. [Google Scholar]
  • 23.Shi H., Shi L., Sun G., Hwang K.S. Adaptive Image-Based Visual Servoing for Hovering Control of Quad-Rotor. IEEE Trans. Cogn. Dev. Syst. 2019;12:417–426. doi: 10.1109/TCDS.2019.2908923. [DOI] [Google Scholar]
  • 24.Zheng N., Ma Q., Jin M., Zhang S., Guan N., Yang Q., Dai J. Abdominal-waving control of tethered bumblebees based on sarsa with transformed reward. IEEE Trans. Cybern. 2018;49:3064–3073. doi: 10.1109/TCYB.2018.2838595. [DOI] [PubMed] [Google Scholar]
  • 25.Tang L., Yang Z.X., Jia K. Canonical correlation analysis regularization: An effective deep multiview learning baseline for RGB-D object recognition. IEEE Trans. Cogn. Dev. Syst. 2018;11:107–118. doi: 10.1109/TCDS.2018.2866587. [DOI] [Google Scholar]
  • 26.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]
  • 27.Sutton R.S., McAllester D.A., Singh S.P., Mansour Y. Policy gradient methods for reinforcement learning with function approximation; Proceedings of the Advances in Neural Information Processing Systems; Denver, CO, USA. 27–30 November 2000; pp. 1057–1063. [Google Scholar]
  • 28.Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M. Deterministic policy gradient algorithms; Proceedings of the International Conference on Machine Learning (PMLR); Bejing, China. 22–24 June 2014; pp. 387–395. [Google Scholar]
  • 29.Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous control with deep reinforcement learning. arXiv. 20151509.02971 [Google Scholar]
  • 30.Latifi K., Kopitca A., Zhou Q. Model-free control for dynamic-field acoustic manipulation using reinforcement learning. IEEE Access. 2020;8:20597–20606. doi: 10.1109/ACCESS.2020.2969277. [DOI] [Google Scholar]
  • 31.Leinen P., Esders M., Schütt K.T., Wagner C., Müller K.R., Tautz F.S. Autonomous robotic nanofabrication with reinforcement learning. Sci. Adv. 2020;6:eabb6987. doi: 10.1126/sciadv.abb6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  • 33.Zeng Y., Wang G., Xu B. A basal ganglia network centric reinforcement learning model and its application in unmanned aerial vehicle. IEEE Trans. Cogn. Dev. Syst. 2017;10:290–303. doi: 10.1109/TCDS.2017.2649564. [DOI] [Google Scholar]
  • 34.Guo X., Yan W., Cui R. Event-triggered reinforcement learning-based adaptive tracking control for completely unknown continuous-time nonlinear systems. IEEE Trans. Cybern. 2019;50:3231–3242. doi: 10.1109/TCYB.2019.2903108. [DOI] [PubMed] [Google Scholar]
  • 35.Zhang J., Shi P., Xia Y., Yang H., Wang S. Composite disturbance rejection control for Markovian Jump systems with external disturbances. Automatica. 2020;118:109019. doi: 10.1016/j.automatica.2020.109019. [DOI] [Google Scholar]
  • 36.Ahmed S., Wang H., Tian Y. Adaptive high-order terminal sliding mode control based on time delay estimation for the robotic manipulators with backlash hysteresis. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1128–1137. doi: 10.1109/TSMC.2019.2895588. [DOI] [Google Scholar]
  • 37.Chen M., Xiong S., Wu Q. Tracking flight control of quadrotor based on disturbance observer. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:1414–1423. doi: 10.1109/TSMC.2019.2896891. [DOI] [Google Scholar]
  • 38.Zhao Z., He X., Ahn C.K. Boundary disturbance observer-based control of a vibrating single-link flexible manipulator. IEEE Trans. Syst. Man Cybern. Syst. 2019;51:2382–2390. doi: 10.1109/TSMC.2019.2912900. [DOI] [Google Scholar]
  • 39.Alibekov E., Kubalík J., Babuška R. Policy derivation methods for critic-only reinforcement learning in continuous spaces. Eng. Appl. Artif. Intell. 2018;69:178–187. doi: 10.1016/j.engappai.2017.12.004. [DOI] [Google Scholar]
  • 40.Hasselt H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010;23:2613–2621. [Google Scholar]
  • 41.Zhang S., Sun C., Feng Z., Hu G. Trajectory-Tracking Control of Robotic Systems via Deep Reinforcement Learning; Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM); Bangkok, Thailand. 18–20 November 2019; pp. 386–391. [Google Scholar]
  • 42.Kiumarsi B., Vamvoudakis K.G., Modares H., Lewis F.L. Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2017;29:2042–2062. doi: 10.1109/TNNLS.2017.2773458. [DOI] [PubMed] [Google Scholar]
  • 43.Yang X., Zhang H., Wang Z. Policy Gradient Reinforcement Learning for Parameterized Continuous-Time Optimal Control; Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC); Kunming, China. 22–24 May 2021; pp. 59–64. [Google Scholar]
  • 44.Xiao X., Xi R., Li Y., Tang Y., Ding B., Ren H., Meng M.Q.H. Design and control of a novel electromagnetic actuated 3-DoFs micropositioner. Microsyst. Technol. 2021;27:1–10. doi: 10.1007/s00542-020-05163-3. [DOI] [Google Scholar]
  • 45.Tommasino P., Caligiore D., Mirolli M., Baldassarre G. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Trans. Cogn. Dev. Syst. 2016;11:292–317. [Google Scholar]
  • 46.Srikant R., Ying L. Finite-time error bounds for linear stochastic approximation andtd learning; Proceedings of the Conference on Learning Theory (PMLR); Phoenix, AZ, USA. 25–28 June 2019; pp. 2803–2830. [Google Scholar]
  • 47.Feng Z., Ming M., Ling J., Xiao X., Yang Z.X., Wan F. Fractional delay filter based repetitive control for precision tracking: Design and application to a piezoelectric nanopositioning stage. Mech. Syst. Signal Process. 2022;164:108249. doi: 10.1016/j.ymssp.2021.108249. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Micromachines are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES