| Algorithm 1 PPO-Based Training Algorithm |
|
Input: discounting factor ; clipping ratio ; update epoch ; number of training steps ; critic network ; actor-network , behavior actor-network , where ; entropy loss coefficient ; value function loss coefficient ; policy loss coefficient . 1 Initialize , , and ; 2 for e = 1 to E 3 Pick N independent scheduling instances from distribution D; 4 for n = 1 to N 5 for t = 1 to … 6 sample based on ; 7 Receive reward and next state ; 8 Compute the advantage function and probability ratio . 9 = ; 10 = 11 while is terminal do 12 break; 13 end while 14 end for 15 Compute the policy loss , the value function loss and the entropy loss . 16 ; 17 ; 18 , where is entropy; 19 Total Losses: ; 20 end for 21 for l = 1 to L 22 Update , with cumulative loss by Adam optimizer: 23 , 24 end for 25 26 end for 27 Output: Trained parameter set of . |