Carbon-Efficient Scheduling in Fresh Food Supply Chains with a Time-Window-Constrained Deep Reinforcement Learning Model

. 2024 Nov 22;24(23):7461. doi: 10.3390/s24237461

Algorithm 1 Reinforce with baseline (Rollout).

Require:
Initial network weight $θ$ , Input Instance x
Ensure:
$< I s D o n e, \sum c o s t, O u t p u t S e q u e n c e$ P>
1:
repeat
2:
Generate corresponding action $a_{t}$ through attention networks that consider traffic conditions
3:
if Decode strategy = greedy then
4:
Choose the point with the highest probability
5:
Rollout provides stable reference values based on determined strategies
6:
else if Decode strategy = sample then
7:
Select nodes based on probability value sampling
8:
The model enhances exploration capabilities through sampling
9:
end if
10:
Execute action $a_{t}$ and observe new observation $s_{t}$
11:
Put the node selected by action $a_{t}$ at into the Sequence P
12:
$τ \leftarrow τ + 1$
13:
until Service success or failure
14:
Obtain the cumulative return $G_{t}^{g r e e d y}$ of the model under the Rollout baseline
15:
Obtain the cumulative return $G_{t}^{s a m p l e}$ of the model under the sampling strategy
16:
$θ_{t + 1} \leftarrow θ_{t} + α (G_{t}^{s a m p l e} - G_{t}^{g r e e d y}) \nabla_{θ} log P_{θ} (π | x)$
17:
return $< I s D o n e, G_{t}^{g r e e d y}, P >$