|
Algorithm 1. The A2C-based training method. |
-
1:
Initialize actor and critic network ,
-
2:
for epoch = 1: EP
do
-
3:
Perform new job arrival at time zero (current system time)
-
4:
Get current state st
-
5:
whilestep = 1: n
do \\ n is the number of jobs for the selected instance
-
6:
Determine an action at based on probability at state st
-
7:
Select a job j from BF using action at, process job j in all machines, obtain the finished time of job j in each machine
-
8:
Push forward the system time only to the time when job j is finished in M1
-
9:
Perform new job arrival and update WIP at current system time
-
10:
Get current state st+1 and reward rt
-
11:
Store transition of this step
-
12:
st ← st+1
-
13:
if step % T == 0 then
-
14:
Calculate discounted reward drt of the T steps in reverse order using data in transitions,
-
15:
Update critic-network using gradient
-
16:
Update actor-network using gradient
-
17:
end if
-
18:
end while
-
19:
end for
|