|
Algorithm 1 Reinforce with baseline (Rollout). |
-
Require:
Initial network weight , Input Instance x
-
Ensure:
P>
-
1:
repeat
-
2:
Generate corresponding action through attention networks that consider traffic conditions
-
3:
if Decode strategy = greedy then
-
4:
Choose the point with the highest probability
-
5:
Rollout provides stable reference values based on determined strategies
-
6:
else if Decode strategy = sample then
-
7:
Select nodes based on probability value sampling
-
8:
The model enhances exploration capabilities through sampling
-
9:
end if
-
10:
Execute action and observe new observation
-
11:
Put the node selected by action at into the Sequence P
-
12:
-
13:
until Service success or failure
-
14:
Obtain the cumulative return of the model under the Rollout baseline
-
15:
Obtain the cumulative return of the model under the sampling strategy
-
16:
-
17:
return
|