Algorithm 1 Policy and Model Optimization (PMO) |
-
1:
Input: and Stage 01
-
2:
Output:
-
3:
Stage 02 as per Equation (1)
-
4:
Initialize , ,
-
5:
while not done do:
-
6:
Episode = sample()
-
7:
for do:
-
8:
-
9:
Append:
-
10:
-
11:
-
12:
Every and do:
-
13:
Compute loss for DBM: MSE ()
-
14:
DBM Optimization: Stage 02 on repeat
-
15:
Compute Behavioral loss: () From Equations (3) and (4)
-
16:
Compute policy loss: () From Equations (3) and (4)
-
17:
Policy Training & Optimization: Stage 03
|