Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation

View full-text article in PMC

. 2023 Jan 9;23(2):762. doi: 10.3390/s23020762

Algorithm 1 VIB based meta-reinforcement learning training algorithm.

1:
Input: $D$ : training data set; $η_{θ}, η_{φ}, η_{ψ}, η_{ω}$ denote learning rate; ${T_{m}}_{m = 1, \dots, M} \sim p (T)$ : meta-training task set.
2:
Output: policy network $π_{θ}$ and latent space encoder $E_{ω}$ .
3:
Setting parameter for the target network: $ψ \leftarrow \bar{ψ}$ ; the initial sample set for each task $D^{m}$
4:
for epoch do
5:
for $T_{m}$ do
6:
Initializing each trajectory: $e^{T_{m}} = {}$
7:
for $k = 1, \dots, N$ do
8:
Latent space inference $z \sim E_{ω} (z| e^{T_{m}})$
9:
The current policy $π_{θ} (a| s, z)$ interact with each task and obtain sample $D^{m}$
10:
Updating $e^{T_{m}} = {(s_{j}, a_{j}, s_{j}^{'}, r_{j})}_{j : 1 \dots K} \sim D^{m}$
11:
end for
12:
end for
13:
for step do
14:
for $T_{m}$ do
15:
Sampling from the training data set: $e^{T_{m}}, d^{m} \sim D^{m}$
16:
Latent space inference: $z \sim E_{ω} (z| e^{T_{m}})$
17:
Computing the action-state value function: $J^{m} (Q) = J_{E_{φ}} (d^{m}, z)$
18:
Computing the state value function: $J^{m} (V) = J_{E_{ψ}} (d^{m}, z)$
19:
Computing the policy cost function: $J^{m} (π) = J_{E_{θ}} (d^{m}, z)$
20:
Computing the latent space encoder cost function: $J^{m} (E) = J_{E_{ω}} (d^{m}, z)$
21:
end for
22:
Updating the action-state value function network: $φ_{t + 1} \leftarrow φ_{t} - η_{φ} {\hat{\nabla}}_{φ} \sum_{m} J^{m} (Q)$
23:
Updating the state value function network: $ψ_{t + 1} \leftarrow ψ_{t} - η_{ψ} {\hat{\nabla}}_{ψ} \sum_{m} J^{m} (V)$
24:
Updating the policy network: $θ_{t + 1} \leftarrow θ_{t} - η_{θ} {\hat{\nabla}}_{θ} \sum_{m} J^{m} (π)$
25:
Updating the latent space encoder network: $ω_{t + 1} \leftarrow ω_{t} - η_{ω} {\hat{\nabla}}_{ω} \sum_{m} J^{m} (E)$
26:
end for
27:
Updating the target network $ψ \leftarrow \bar{ψ}$
28:
end for