Generative AI and Blockchain-Integrated Multi-Agent Framework for Resilient and Sustainable Fruit Cold-Chain Logistics

. 2025 Aug 27;14(17):3004. doi: 10.3390/foods14173004

Algorithm 4 Sustainability-Aware Inventory Management via Actor–Critic RL with Context-Aware Weights and Coordination

1:
Input: Local inventory state $s = [stock level, demand forecast, shelf life, carbon score]$
2:
Initialize: Actor network $μ_{θ} (s)$ , critic network $Q_{ϕ} (s, a)$ , replay buffer $R$
3:
Initialize: Shared intention buffer $B \leftarrow \emptyset$ , coefficients $α_{1}, α_{2}, α_{3}$
4:
for each episode do
5:
Observe global inventory state $S = {s^{(1)}, s^{(2)}, \dots, s^{(n)}}$ and local state $s_{0}$
6:
for each timestep t do
7:
Select order quantity $a_{t} = μ_{θ} (s_{t})$
8:
Append $(s_{t}, a_{t})$ to shared buffer $B$
9:
if conflict detected in $B$ (e.g., stock over-allocation or supply contention) then
10:
Apply coordination penalty $ρ$ or reassign $a_{t}$
11:
end if
12:
Optionally exchange supply info with peers (e.g., via blockchain or DIDs)
13:
Execute $a_{t}$ , observe new state $s_{t + 1}$
14:
Compute spoilage loss $L_{spoil}$ from overstocked perishables
15:
Compute holding cost $H_{t}$ and emissions $E_{t}$ from delivery
16:
Define context vector: $c t x = [L_{spoil}, H_{t}, E_{t}]$
17:
Compute dynamic weights:
$ω_{j} = \frac{α_{j} \cdot c t x_{j}}{\sum_{k} α_{k} \cdot c t x_{k}}, j = 1, 2, 3$
18:
Compute reward:
$r_{t} = - (ω_{1} \cdot L_{spoil} + ω_{2} \cdot H_{t} + ω_{3} \cdot E_{t}) - ρ$
19:
Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $R$
20:
Sample mini-batch from $R$
21:
Update critic:
$L_{critic} = {(r + γ Q_{ϕ} (s^{'}, μ_{θ} (s^{'})) - Q_{ϕ} (s, a))}^{2}$
22:
Update actor via policy gradient:
$\nabla_{θ} J \approx E_{s \sim R} [\nabla_{a} Q_{ϕ} (s, a) \cdot \nabla_{θ} μ_{θ} (s)]$
23:
Clear $B \leftarrow \emptyset$
24:
end for
25:
end for
26:
Output: Trained inventory ordering policy $μ^{*} (s)$