|
Algorithm 2 PPO-Based Deep Optimistic Linear Support |
-
1:
Initialization:
-
2:
= partial CSS, which is composed of obtained after the PPO learning.
-
3:
= corner weights, which is obtained from .
-
4:
= priority queue of weights for the multi-objective, where the weights form a tuple along with their importance (i.e., ()).
-
5:
.pop()
-
6:
for iteration=1,2, ….,do
-
7:
for iteration=1,2, …., T do
-
8:
-
9:
-
10:
Reduce scaling of
-
11:
-
12:
compute advantage estimate from Equation (26)
-
13:
-
14:
end for
-
15:
Optimize surrogate L and wrt from , with K epochs
-
16:
Optimize and wrt from , with K epochs
-
17:
-
18:
end for when convergence
-
19:
-
20:
-
21:
if then
-
22:
= remove obsolete due to new
-
23:
= new corner weight from
-
24:
-
25:
= remove obsolete due to new
-
26:
for iteration=1,2, …., do
-
27:
if estimate improvement of then
-
28:
-
29:
end if
-
30:
end for
-
31:
end if
-
32:
if is not empty then
-
33:
go back to line 1
-
34:
end if
|