A Reinforcement Learning Approach to Robust Scheduling of Permutation Flow Shop

. 2023 Oct 7;8(6):478. doi: 10.3390/biomimetics8060478

Algorithm 1 PPO-Based Training Algorithm

Input: discounting factor

γ

; clipping ratio

ε

; update epoch

L

; number of training steps

E

; critic network

v_{\emptyset}

; actor-network

π_{θ}

, behavior actor-network

π_{θ_{o l d}}

, where

{θ = θ}_{o l d}

; entropy loss coefficient

f_{e}

; value function loss coefficient

f_{v}

; policy loss coefficient

f_{p}

.
1 Initialize

π_{θ}

π_{θ_{o l d}}

, and

v_{\emptyset}

;
2    for e = 1 to E
3        Pick N independent scheduling instances from distribution D;
4        for n = 1 to N
5              for t = 1 to …
6                  sample

a_{n, t}

based on

π_{θ_{o l d}} (a_{n, t}| S_{n, t})

;
7 Receive reward

r_{n, t}

and next state

S_{n, t + 1}

;
8 Compute the advantage function

{\hat{A}}_{n, t}

and probability ratio

r_{n, t} (θ)

.
9

{\hat{A}}_{n, t}

\sum_{0}^{t} Y^{t} r_{n, t} - V_{\emptyset} (S_{n}, t)

;
10

r_{n, t} (θ)

\frac{π_{θ} (a_{n, t}| S_{n, t})}{π_{θ_{o l d} (a_{n, t}| S_{n, t})}}

11 while

s_{n, t}

is terminal do
12                          break;
13                  end while
14            end for
15            Compute the policy loss

L_{n}^{P P O} (θ)

, the value function loss

L_{n}^{c r i t i c} (\emptyset)

and the entropy loss

L_{n}^{S} (θ)

.
16

L_{n}^{P P O} (θ) = \sum_{0}^{t} m i n (r_{n, t} (θ) {\hat{A}}_{n, t}, c l i p (r_{n, t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{n, t})

;
17

L_{n}^{c r i t i c} (\emptyset) = \sum_{0}^{t} {(v_{\emptyset} (s_{n, t}) - {\hat{A}}_{n, t})}^{2}

;
18

L_{n}^{S} (θ) = \sum_{0}^{t} S (π_{θ} (a_{n, t} | s_{n, t}))

, where

S (\cdot)

is entropy;
19 Total Losses:

L_{n} (θ, \emptyset) = f_{p} L_{n}^{P P O} (θ) - {f_{v} L}_{n}^{c r i t i c} (\emptyset) + {f_{e} L}_{n}^{S} (θ)

;
20        end for
21        for l = 1 to L
22            Update

θ

\emptyset

with cumulative loss by Adam optimizer:
23

θ

\emptyset \leftarrow A d a m (\sum_{n = 1}^{N} L_{n} (θ, \emptyset))

24 end for
25

θ_{o l d} \leftarrow θ

26 end for
27 Output: Trained parameter set of

θ