Skip to main content
. 2025 Aug 27;14(17):3004. doi: 10.3390/foods14173004
Algorithm 4 Sustainability-Aware Inventory Management via Actor–Critic RL with Context-Aware Weights and Coordination
  •   1:

    Input: Local inventory state s=[stocklevel,demandforecast,shelflife,carbonscore]

  •   2:

    Initialize: Actor network μθ(s), critic network Qϕ(s,a), replay buffer R

  •   3:

    Initialize: Shared intention buffer B, coefficients α1,α2,α3

  •   4:

    for each episode do

  •   5:

       Observe global inventory state S={s(1),s(2),,s(n)} and local state s0

  •   6:

       for each timestep t do

  •   7:

             Select order quantity at=μθ(st)

  •   8:

             Append (st,at) to shared buffer B

  •   9:

             if conflict detected in B (e.g., stock over-allocation or supply contention) then

  • 10:

                 Apply coordination penalty ρ or reassign at

  • 11:

             end if

  • 12:

             Optionally exchange supply info with peers (e.g., via blockchain or DIDs)

  • 13:

             Execute at, observe new state st+1

  • 14:

             Compute spoilage loss Lspoil from overstocked perishables

  • 15:

             Compute holding cost Ht and emissions Et from delivery

  • 16:

             Define context vector: ctx=[Lspoil,Ht,Et]

  • 17:
             Compute dynamic weights:
    ωj=αj·ctxjkαk·ctxk,j=1,2,3
  • 18:
             Compute reward:
    rt=(ω1·Lspoil+ω2·Ht+ω3·Et)ρ
  • 19:

             Store transition (st,at,rt,st+1) in R

  • 20:

             Sample mini-batch from R

  • 21:
             Update critic:
    Lcritic=r+γQϕ(s,μθ(s))Qϕ(s,a)2
  • 22:
             Update actor via policy gradient:
    θJEsRaQϕ(s,a)·θμθ(s)
  • 23:

             Clear B

  • 24:

          end for

  • 25:

    end for

  • 26:

    Output: Trained inventory ordering policy μ*(s)