Skip to main content
. 2022 Mar 25;24(4):455. doi: 10.3390/e24040455
Algorithm 1 Training process in the reward learning process
Input: number of full episodes  K, timesteps T, fixed parameters θold, target firing rate f0, regularization hyper-parameters µv, µe, µfiring, bandwidth σ, predicted value function Vθ(t,k) and sum of future rewards R(t,k)
Output: total loss Lθ.
  •     1.

    Parameters setting: f0, µv, µe, µfiring and σ.

  •     2.

    for n in batch size N:

  •     3.

           Set en=R(t,k)Vθ(t,k)

  •     4.

           if number of literation is 0:

  •     5.

                     (φ0,  φ1, φ1)=(N, 0, 0)

  •     6.

           else:

                    (φ0,  φ1, φ1)=(#{en(0.5, 0.5)},

                                          #{en(1,0.5)},

                                            #{en(0.5, 1)})

                    where #{·} indicates counting the samples that satisfy the condition

  •     7.

            (un,vn,sn)=(exp(en22σ2),exp((en+1)22σ2),exp((en1)22σ2))

  •     8.

            LnRMEE=φ0unen2+φ1vn(en+1)2+φ1sn(en1)2

  •     9.

    end for

  •    10.

    for k in K:

  •    11.

            for t in T:

                 L(t,k)PPO=OPPO(θold,θ, t, k)

  •    12.

            end for

  •    13.

    end for

  •    14.

     Calculate the total loss:

           L(e)=Lp(e)+Jk(e)

  •    15.

    return L(e)