Intelligent Decision-Making of Scheduling for Dynamic Permutation Flowshop via Deep Reinforcement Learning

. 2021 Feb 2;21(3):1019. doi: 10.3390/s21031019

Algorithm 1. The A2C-based training method.

1:
Initialize actor and critic network $π_{θ}$ , $V_{ϕ}$
2:
for epoch = 1: EP do
3:
Perform new job arrival at time zero (current system time)
4:
Get current state s_t
5:
whilestep = 1: n do \\ n is the number of jobs for the selected instance
6:
Determine an action a_t based on probability $π_{θ} (s_{t})$ at state s_t
7:
Select a job j from BF using action a_t, process job j in all machines, obtain the finished time of job j in each machine
8:
Push forward the system time only to the time when job j is finished in M₁
9:
Perform new job arrival and update WIP at current system time
10:
Get current state s_t+₁ and reward r_t
11:
Store transition ${s_{t}, a_{t}, r_{t}, s_{t + 1}}$ of this step
12:
s_t ← s_t+₁
13:
if step % T == 0 then
14:
Calculate discounted reward dr_t of the T steps in reverse order using data in transitions, $d r_{t} = \{\begin{matrix} r_{t} + γ V_{ϕ} (s_{t + 1}), for the T th step \\ r_{t} + γ d r_{t + 1}, for the first T - 1 steps \end{matrix}$
15:
Update critic-network $V_{ϕ}$ using gradient $\nabla_{ϕ} {(d r_{t} - V_{ϕ} (s_{t}))}^{2}$
16:
Update actor-network $π_{θ}$ using gradient $\nabla_{θ} \log π_{θ} (a_{t} | s_{t}) [d r_{t} - V_{ϕ} (s_{t})] + β \nabla_{θ} H (π_{θ} (s_{t}; θ))$
17:
end if
18:
end while
19:
end for