Skip to main content
. 2024 Jun 8;9(6):346. doi: 10.3390/biomimetics9060346
Algorithm 2 DDPG with HER and Adaptive Exploration
  •   1:

    Initialize online policy network μθ with weights θ

  •   2:

    Initialize target policy network μθ with weights θθ

  •   3:

    Initialize online Q network Qϕ with weights ϕ

  •   4:

    Initialize target Q network Qϕ with weights ϕϕ

  •   5:

    Initialize experience replay pool D to capacity N

  •   6:

    for episode = 1, M do

  •   7:

        Receive initial observation state s1

  •   8:

        for t = 1, T do

  •   9:

           Obtain μ from the adaptive exploration adjustment unit

  • 10:

         Calculate exploration noise

  • 11:

         Select action at=μθ(st)+explorationnoise

  • 12:

         Execute action at in the environment

  • 13:

         Observe reward rt and new state st+1

  • 14:

         Store transition (st,at,rt,st+1) in D

  • 15:

         Store modified transitions with alternative goals

  • 16:

         Sample random mini-batch of K transitions (si,ai,ri,si+1) from D

  • 17:

         Set yi=ri+γQϕ(si+1,μθ(si+1))

  • 18:

         Update ϕ by minimizing the loss: L=1Ki(yiQϕ(si,ai))2

  • 19:
         Update θ using the sampled policy gradient:
    θJ1KiaQϕ(s,a)|s=si,a=μθ(si)θμθ(s)|si
  • 20:
         Update the target networks:
    θτθ+(1τ)θϕτϕ+(1τ)ϕ
  • 21:

      end for

  • 22:

    end for