Skip to main content
. 2024 Mar 14;24(6):1870. doi: 10.3390/s24061870
Algorithm 1 REINFORCE
  •  1:

    Input α learning rate, γ discounted factor

  •  2:

    Initialize environment E

  •  3:

    Initialize policy parameters θ

  •  4:

    for episode in 1 …N do

  •  5:

      Use π(s|θ)tocollect|E|trajectories:S0,A0,R0,,RT

  •  6:

      G=0

  •  7:

      for t = T1 … 0 do

  •  8:

        G=Rt+γG

  •  9:

        Compute entropy regularization ERt=απ(At|St)Logπ(At|St)

  • 10:

        J(θt)^=γtGLogπ(At|St,θt)ERt

  • 11:

        θt+1=θt+αJ(θt)^

  • 12:

      end for

  • 13:

    end for