Skip to main content
. 2025 Aug 27;14(17):3004. doi: 10.3390/foods14173004
Algorithm 1 Perishable-Aware Route Optimization via Q-Learning with Context-Aware Weights and Conflict Avoidance
  •   1:

    Input: Cold-chain graph G(V,E), perishability profile P, disruption model D, emission matrix C

  •   2:

    Initialize: Q-table Q(s, a)0; learning rate α; discount factor γ; exploration rate ϵ

  •   3:

    Initialize: Static priority coefficients α1, α2, α3

  •   4:

    Initialize: Shared intention buffer B

▹ For coordination
  •   5:

    for each episode do

  •   6:

           Initialize joint global state S0={s0(1),,s0(n)} using D

  •   7:

           while shipment not delivered do

  •   8:

                 for each routing agent i do

  •   9:

                    With probability ϵ, choose random action a(i)

  • 10:

                  Otherwise, choose a(i)argmaxaQ(s(i), a)

  • 11:

                    Append (s(i), a(i)) to B

▹ Declare action intention
  • 12:

                 end for

  • 13:

                 Detect conflicts in B (e.g., duplicate vehicle or route allocation)

  • 14:

                 if conflict detected then

  • 15:

                 Apply coordination penalty ρ or reassign conflicting agent(s) via tie-breaking

  • 16:

                 end if

  • 17:

                 for each agent i do

  • 18:

                 Execute a(i), observe s(i), travel time t(i), temp deviation ΔT(i), emissions e(i)

  • 19:

                 Compute spoilage risk: σ(i)f(P, ΔT(i))

  • 20:

                 Extract context vector: ctx(i)=[ΔT(i), traffic, SLApriority]

  • 21:
                 Compute dynamic weights:
    ωj(i)=αj·ctxj(i)k=13αk·ctxk(i),j=1,2,3
  • 22:
                 Compute context-aware reward:
    r(i)=(ω1(i)t(i)+ω2(i)σ(i)+ω3(i)e(i))ρ
  • 23:
                 Update Q-table:
    Q(s(i), a(i))Q(s(i),a(i))+αr(i)+γmaxaQ(s(i),a)Q(s(i),a(i))
  • 24:

                 Update state: s(i)s(i)

  • 25:

                 end for

  • 26:

                 Clear intention buffer: B

  • 27:

        end while

  • 28:

    end for

  • 29:

    Output: Learned policies πi*(s)=argmaxaQ(s, a) for all agents i