Algorithm 2 Student Policy OpTimization (SPOT) |
-
1:
Input: and Stage 01
-
2:
Output:
-
3:
Stage 02 as per Equation (1)
-
4:
Initialize , ,
-
5:
while not done do:
-
6:
Episode = sample()
-
7:
for do:
-
8:
-
9:
Append:
-
10:
Policy Buffer:
-
11:
Every and do:
-
12:
Compute Loss: From Equation (2)
-
13:
Policy Training: Stage 03
|