Skip to main content
. 2023 Oct 7;8(6):478. doi: 10.3390/biomimetics8060478
Algorithm 1 PPO-Based Training Algorithm
Input: discounting factor γ; clipping ratio ε; update epoch L; number of training steps E; critic network v; actor-network πθ, behavior actor-network πθold, where θ=θold; entropy loss coefficient fe; value function loss coefficient fv; policy loss coefficient fp.
1    Initialize πθ, πθold, and v;
2    for e = 1 to E
3        Pick N independent scheduling instances from distribution D;
4        for n = 1 to N
5              for t = 1 to …
6                  sample an,t based on πθoldan,tSn,t;
7                  Receive reward rn,t and next state Sn,t+1;
8                  Compute the advantage function A^n,t and probability ratio rn,t(θ).
9                  A^n,t = 0tYtrn,tV(Sn,t);
10                    rn,t(θ)=πθan,tSn,tπθoldan,tSn,t
11                    while sn,t is terminal do
12                          break;
13                  end while
14            end for
15            Compute the policy loss LnPPOθ, the value function loss Lncritic and the entropy loss LnSθ.
16            LnPPOθ=0tmin(rn,tθA^n,t,clip(rn,tθ,1ε,1+ε)A^n,t);
17            Lncritic=0t(vsn,tA^n,t)2;
18            LnSθ=0tS(πθ(an,t|sn,t)), where S(·) is entropy;
19            Total Losses: Lnθ,=fpLnPPOθfvLncritic+feLnSθ;
20        end for
21        for l = 1 to L
22            Update θ, with cumulative loss by Adam optimizer:
23            θ,  Adam(n=1NLnθ,)
24        end for
25    θoldθ
26    end for
27    Output: Trained parameter set of θ.