Skip to main content
. 2021 Feb 2;21(3):1019. doi: 10.3390/s21031019
Algorithm 1. The A2C-based training method.
  • 1:

    Initialize actor and critic network πθ, Vϕ

  • 2:

    for epoch = 1: EP do

  • 3:

    Perform new job arrival at time zero (current system time)

  • 4:

    Get current state st

  • 5:

    whilestep = 1: n do \\ n is the number of jobs for the selected instance

  • 6:

    Determine an action at based on probability πθ(st) at state st

  • 7:

    Select a job j from BF using action at, process job j in all machines, obtain the finished time of job j in each machine

  • 8:

    Push forward the system time only to the time when job j is finished in M1

  • 9:

    Perform new job arrival and update WIP at current system time

  • 10:

    Get current state st+1 and reward rt

  • 11:

    Store transition {st,at,rt,st+1} of this step

  • 12:

    stst+1

  • 13:

    if step % T == 0 then

  • 14:

    Calculate discounted reward drt of the T steps in reverse order using data in transitions, drt=rt+γVϕ(st+1), for the Tth steprt+γdrt+1, for the first T1 steps

  • 15:

    Update critic-network Vϕ using gradient ϕ(drtVϕ(st))2

  • 16:

    Update actor-network πθ using gradient θlogπθ(at|st)[drtVϕ(st)]+βθH(πθ(st;θ))

  • 17:

    end if

  • 18:

    end while

  • 19:

    end for