Generalization Enhancement of Visual Reinforcement Learning through Internal States

. 2024 Jul 12;24(14):4513. doi: 10.3390/s24144513

Algorithm 1 Full training process for proposed ISSA

Inputs:

$π_{ϕ^{T}}, Q_{θ_{1}^{T}}, Q_{θ_{2}^{T}}, Q_{{\bar{θ}}_{1}^{T}}, Q_{{\bar{θ}}_{2}^{T}}$ : Teacher’s parametric networks for policy and Q functions, both based on combination of visual observation and internal states.

$π_{ϕ^{S}}, Q_{θ_{1}^{S}}, Q_{θ_{2}^{S}}, Q_{{\bar{θ}}_{1}^{S}}, Q_{{\bar{θ}}_{2}^{S}}$ : Student’s parametric networks for policy and Q functions, both solely based on visual observation.

$a u g$ : image augmentation method inherited from DrQ-v2.

$f_{ξ}, T_{s}, T_{r}, B, α, τ$ : parametric network for image encoder, training steps for transfer learning and reinforcement learning, mini-batch size, learning rate and target update rate.

1:
procedure Main
2:
$s_{0}^{'} \leftarrow f_{ξ} (x_{0})$ ▹ Get encoded state for initial observations
3:
for t:=0 to $T_{s}$ do ▹ Transfer learning part
4:
$a_{t} \sim π_{ϕ^{T}} (\cdot | s_{t}^{'}, i_{t})$ ▹ Sample action based on teacher’s policy
5:
$x_{t + 1} \leftarrow p (\cdot | i_{t}, a_{t})$ ▹ Get consecutive observations
6:
$s_{t + 1}^{'} \leftarrow f_{ξ} (x_{t + 1})$ ▹ Get encoded state for timestep $t + 1$
7:
$D \leftarrow D \cup (x_{t}, i_{t}, a_{t}, r (i_{t}, a_{t}), x_{t + 1}, i_{t + 1})$ ▹ Store transition
8:
UpdateCriticTansfer( $D$ ) ▹ Update critic for teacher and student
9:
UpdateActorTransfer( $D$ ) ▹ Update policy for teacher and student
10:
end for ▹ End of Phase 1 (transfer learning phase)
11:
$s_{0}^{'} \leftarrow f_{ξ} (x_{0})$ ▹ Get encoded state for initial observations
12:
for t:=0 to $T_{r}$ do ▹ Reinforcement learning part
13:
Traditional reinforcement learning based on $π_{ϕ^{S}}, Q_{θ_{1}^{S}}, Q_{θ_{2}^{S}}, Q_{{\bar{θ}}_{1}^{S}}, Q_{{\bar{θ}}_{2}^{S}}$
14:
end for ▹ End of the entire training process
15:
end procedure
16:
procedure UpdateCriticTransfer( $D$ )
17:
${(x_{t}, i_{t}, a_{t}, r_{t : t + n - 1}, x_{t + n}, i_{t + n})} \sim D$ ▹ Sample a mini batch of B transitions
18:
$s_{t}^{'}, s_{t + n}^{'} \leftarrow f_{ξ} (a u g (x_{t})), f_{ξ} (a u g (x_{t + n}))$ ▹ Data augmentation and encoding
19:
$a_{t + n} \sim π_{ϕ^{T}} (\cdot | s_{t + n}^{'}, i_{t + n})$ ▹ Sample action for timestep $t + n$
20:
Compute $L_{θ_{1}^{T}, ξ}, L_{θ_{2}^{T}, ξ}, L_{θ_{1}^{S}, ξ}, L_{θ_{2}^{S}, ξ}$ using Equation (1).
21:
$ξ \leftarrow ξ - α \nabla_{ξ} (L_{θ_{1}^{T}, ξ} + L_{θ_{2}^{T}, ξ} + L_{θ_{1}^{S}, ξ} + L_{θ_{2}^{S}, ξ})$ ▹ Update encoder
22:
$θ_{k}^{i} \leftarrow θ_{k}^{i} - α \nabla_{θ_{k}^{i}} L_{θ_{k}^{i}, ξ} k \in {1, 2}, i \in {T, S}$ ▹ Update Q functions
23:
${\bar{θ}}_{k}^{i} \leftarrow (1 - τ) {\bar{θ}}_{k}^{i} + τ θ_{k}^{i} k \in {1, 2}, i \in {T, S}$ ▹ Soft update target-Q functions
24:
end procedure
25:
procedure UpdateActorTransfer( $D$ )
26:
${(x_{t}, i_{t})} \sim D$ ▹ Sample a mini batch of B observations and internal states
27:
$s_{t}^{'} \leftarrow f_{ξ} (a u g (x_{t}))$ ▹ Data augmentation and encoding
28:
$a_{t} \sim π_{ϕ^{T}} (\cdot | s_{t}^{'}, i_{t})$ ▹ Sample action for timestep $t + n$
29:
Compute $L_{ϕ^{T}}, L_{ϕ^{S}}$ using Equations (3) and (4).
30:
$ϕ^{i} \leftarrow ϕ^{i} - α \nabla_{ϕ^{i}} L_{ϕ^{i}} i \in {T, S}$ ▹ Update teacher’s and student’s policy
31:
end procedure