Skip to main content
. 2025 Aug 27;14(17):3004. doi: 10.3390/foods14173004
Algorithm 5 SLA-Aware Delivery Scheduling via Cooperative Multi-Agent RL with Context-Aware Weights and Coordination
  •   1:

    Input: Delivery queue Q, route availability R, SLA terms S, demand forecast F

  •   2:

    Agents: A1,A2,,An (e.g., vehicle or hub controllers)

  •   3:

    Initialize: Policy πi(si) for each agent i, shared critic Q(s1,,sn,a1,,an)

  •   4:

    Initialize: Replay buffer R, intention buffer B

  •   5:

    Initialize: Reward weighting coefficients α1, α2, α3

  •   6:

    for each training episode do

  •   7:

       Generate demand and disruptions from F

  •   8:

       Initialize global state S0={s0(1),,s0(n)} from environment

  •   9:

       for each timestep t do

  • 10:

              for each agent i do

  • 11:

                 Select action ai=πi(si)

▹ e.g., assign vehicle or reschedule
  • 12:

                 Append (si,ai) to B

  • 13:

              end for

  • 14:

              if conflicting vehicle assignments or resource overuse in B then

  • 15:

                 Apply penalty ρ or resolve using SLA priority or distance heuristics

  • 16:

              end if

  • 17:

              Execute actions a=[a1,,an], observe s=[s1,,sn]

  • 18:

              for each agent i do

  • 19:

                 Observe: delay δi, SLA violation flag vi, fuel used fi, emissions ei

  • 20:

                 Extract context vector: ctx(i)=[δi,vi,ei]

  • 21:
                 Compute dynamic weights:
    ωj(i)=αj·ctxj(i)kαk·ctxk(i),j=1, 2 ,3
  • 22:
                 Compute reward:
    ri=(ω1(i)·δi+ω2(i)·vi+ω3(i)·ei)ρ
  • 23:

                 Store transition (si,ai,ri,si) in R

  • 24:

              end for

  • 25:

              Sample mini-batch from R

  • 26:
              Update shared critic Q by minimizing temporal-difference loss:
    L=r+γQ(s1,, sn, π1(s1),, πn(sn))Q(s1,, sn, a1,, an)2
  • 27:

              for each agent i do

  • 28:
                 Update actor policy πi to maximize expected reward:
    θiJEaiQ(s,a)·θiπi(si)
  • 29:

              end for

  • 30:

              Clear B

  • 31:

         end for

  • 32:

    end for

  • 33:

    Output: Trained delivery policies π1*,,πn*