|
Inputs:
: Teacher’s parametric networks for policy and Q functions, both based on combination of visual observation and internal states.
: Student’s parametric networks for policy and Q functions, both solely based on visual observation.
: image augmentation method inherited from DrQ-v2.
: parametric network for image encoder, training steps for transfer learning and reinforcement learning, mini-batch size, learning rate and target update rate.
-
1:
procedure
Main
-
2:
▹ Get encoded state for initial observations
-
3:
for t:=0 to do ▹ Transfer learning part
-
4:
▹ Sample action based on teacher’s policy
-
5:
▹ Get consecutive observations
-
6:
▹ Get encoded state for timestep
-
7:
▹ Store transition
-
8:
UpdateCriticTansfer() ▹ Update critic for teacher and student
-
9:
UpdateActorTransfer() ▹ Update policy for teacher and student
-
10:
end for ▹ End of Phase 1 (transfer learning phase)
-
11:
▹ Get encoded state for initial observations
-
12:
for t:=0 to do ▹ Reinforcement learning part
-
13:
Traditional reinforcement learning based on
-
14:
end for ▹ End of the entire training process
-
15:
end procedure
-
16:
procedure UpdateCriticTransfer()
-
17:
▹ Sample a mini batch of B transitions
-
18:
▹ Data augmentation and encoding
-
19:
▹ Sample action for timestep
-
20:
Compute using Equation (1).
-
21:
▹ Update encoder
-
22:
▹ Update Q functions
-
23:
▹ Soft update target-Q functions
-
24:
end procedure
-
25:
procedure UpdateActorTransfer()
-
26:
▹ Sample a mini batch of B observations and internal states
-
27:
▹ Data augmentation and encoding
-
28:
▹ Sample action for timestep
-
29:
Compute using Equations (3) and (4).
-
30:
▹ Update teacher’s and student’s policy
-
31:
end procedure
|