|
Algorithm 1 VIB based meta-reinforcement learning training algorithm. |
-
1:
Input:: training data set; denote learning rate; : meta-training task set.
-
2:
Output: policy network and latent space encoder .
-
3:
Setting parameter for the target network: ; the initial sample set for each task
-
4:
for epoch do
-
5:
for do
-
6:
Initializing each trajectory:
-
7:
for do
-
8:
Latent space inference
-
9:
The current policy interact with each task and obtain sample
-
10:
Updating
-
11:
end for
-
12:
end for
-
13:
for step do
-
14:
for do
-
15:
Sampling from the training data set:
-
16:
Latent space inference:
-
17:
Computing the action-state value function:
-
18:
Computing the state value function:
-
19:
Computing the policy cost function:
-
20:
Computing the latent space encoder cost function:
-
21:
end for
-
22:
Updating the action-state value function network:
-
23:
Updating the state value function network:
-
24:
Updating the policy network:
-
25:
Updating the latent space encoder network:
-
26:
end for
-
27:
Updating the target network
-
28:
end for
|