Skip to main content
. 2022 Sep 10;22(18):6853. doi: 10.3390/s22186853
Algorithm 1 Discretized soft-actor-critic algorithm.
  • 1:

    Initialize the experience buffer D

  • 2:

    Initialize the weight α with 1

  • 3:

    Initialize the actor network πϕ with random parameter ϕ

  • 4:

    Initialize main critic networks Qθi with random parameters θi for i1,2

  • 5:

    Initialize target critic networks Qθ^i with parameters θ^i as main critic networks Qθi

  • 6:

    for each training episode do

  • 7:

       Observe initial state st

  • 8:

       for each step t=1,2,...,T do

  • 9:

         Generate the action at=πϕ(st)

  • 10:

         Execute the action at

  • 11:

         Observe the next state st+1 and the reward rt

  • 12:

         Store the experience (st,at,rt,st+1) at the experience buffer D

  • 13:

         Sample a mini-batch D¯ of a few experiences from the buffer D

  • 14:

         Calculate the target state value Vθ^(s) based on Equation (9)

  • 15:

         Update the main critic network Qθi based on the gradient θiJQ(θi) in Equation (10)

  • 16:

         Update the actor network πϕ based on the gradient ϕJπ(ϕ) in Equation (12)

  • 17:

         Update the weight α based on the gradient αJ(α) in Equation (13)

  • 18:

         For every B steps, use soft update for the target critic networks based on Equation (14)

  • 19:

       end for

  • 20:

    end for