Skip to main content
. 2022 Mar 17;13(3):458. doi: 10.3390/mi13030458
Algorithm 1 DDPG-ID Algorithm.
  •   1:

    Randomly initialize online Q network with weights wQ

  •   2:

    Randomly initialize online policy network with weights wμ

  •   3:

    Initialize the target Q network by wQwQ

  •   4:

    Initialize the target policy network by wμwμ

  •   5:

    Initialize the experience replay buffer Ψ

  •   6:

    Load the simplified micropositioner dynamic model

  •   7:

    for episode = 1, MaxEpisode do

  •   8:

        Initialize a noise process N for exploration

  •   9:

        Initialize ASMDO and ID compensator

  • 10:

        Randomly initialize micropositioner states

  • 11:

        Receive initial observation state s1

  • 12:

        for step = 1, T do

  • 13:

           Select action at=πO(st)+D^t+Nt

  • 14:

           Use at to run micropositioner system model

  • 15:

           Process errors with integral differential compensator

  • 16:

           Receive reward rt and new state st+1

  • 17:

           Store transition (st,at,rt,st+1) in replay buffer Ψ

  • 18:

           Randomly sample a minibatch of M transitions (sj,aj,rj,sj+1) from Ψ

  • 19:

           Set QT=rj+γQT(sj+1,πT(sj+1,wμ),wQ)

  • 20:

           Minimize loss: L(wQ)=1Mj=1M(QTQO(sj,aj,wQ))2 to update online Q network

  • 21:

           Use the sampled policy gradient to update online policy network:

           wμJ=1MjM(ajQO(sj,aj,wQ)wμπOsj,wμ)

  • 22:

           Update the target networks: wQτwQ+(1τ)wQ,wμτwμ+(1τ)wμ

  • 23:

      end for

  • 24:

    end for