Skip to main content
. 2024 Nov 22;24(23):7461. doi: 10.3390/s24237461
Algorithm 1 Reinforce with baseline (Rollout).
  • Require: 

    Initial network weight θ, Input Instance x

  • Ensure: 

    <IsDone,cost,OutputSequence P>

  •     1:

    repeat

  •     2:

       Generate corresponding action at through attention networks that consider traffic conditions

  •     3:

       if Decode strategy = greedy then

  •     4:

           Choose the point with the highest probability

  •     5:

           Rollout provides stable reference values based on determined strategies

  •     6:

       else if Decode strategy = sample then

  •     7:

           Select nodes based on probability value sampling

  •     8:

           The model enhances exploration capabilities through sampling

  •     9:

       end if

  •   10:

       Execute action at and observe new observation st

  •   11:

       Put the node selected by action at at into the Sequence P

  •   12:

       ττ+1

  •   13:

    until Service success or failure

  •   14:

    Obtain the cumulative return Gtgreedy of the model under the Rollout baseline

  •   15:

    Obtain the cumulative return Gtsample of the model under the sampling strategy

  •   16:

    θt+1θt+α(GtsampleGtgreedy)θlogPθ(π|x)

  •   17:

    return <IsDone,Gtgreedy,P>