Skip to main content
. 2024 Jul 12;24(14):4513. doi: 10.3390/s24144513
Algorithm 1 Full training process for proposed ISSA

Inputs:

πϕT,Qθ1T,Qθ2T,Qθ¯1T,Qθ¯2T: Teacher’s parametric networks for policy and Q functions, both based on combination of visual observation and internal states.

πϕS,Qθ1S,Qθ2S,Qθ¯1S,Qθ¯2S: Student’s parametric networks for policy and Q functions, both solely based on visual observation.

aug: image augmentation method inherited from DrQ-v2.

fξ,Ts,Tr,B,α,τ: parametric network for image encoder, training steps for transfer learning and reinforcement learning, mini-batch size, learning rate and target update rate.

  • 1:

    procedure  Main

  • 2:

        s0fξ(x0)                 ▹ Get encoded state for initial observations

  • 3:

        for t:=0 to Ts do                 ▹ Transfer learning part

  • 4:

            atπϕT(·|st,it)                 ▹ Sample action based on teacher’s policy

  • 5:

            xt+1p(·|it,at)                 ▹ Get consecutive observations

  • 6:

            st+1fξ(xt+1)                 ▹ Get encoded state for timestep t+1

  • 7:

            DD(xt,it,at,r(it,at),xt+1,it+1)                 ▹ Store transition

  • 8:

            UpdateCriticTansfer(D)                 ▹ Update critic for teacher and student

  • 9:

            UpdateActorTransfer(D)                 ▹ Update policy for teacher and student

  • 10:

        end for                 ▹ End of Phase 1 (transfer learning phase)

  • 11:

        s0fξ(x0)                 ▹ Get encoded state for initial observations

  • 12:

        for t:=0 to Tr do                 ▹ Reinforcement learning part

  • 13:

            Traditional reinforcement learning based on πϕS,Qθ1S,Qθ2S,Qθ¯1S,Qθ¯2S

  • 14:

        end for                 ▹ End of the entire training process

  • 15:

    end procedure

  • 16:

    procedure UpdateCriticTransfer(D)

  • 17:

        {(xt,it,at,rt:t+n1,xt+n,it+n)}D                 ▹ Sample a mini batch of B transitions

  • 18:

        st,st+nfξ(aug(xt)),fξ(aug(xt+n))                 ▹ Data augmentation and encoding

  • 19:

        at+nπϕT(·|st+n,it+n)                 ▹ Sample action for timestep t+n

  • 20:

        Compute Lθ1T,ξ,Lθ2T,ξ,Lθ1S,ξ,Lθ2S,ξ using Equation (1).

  • 21:

        ξξαξ(Lθ1T,ξ+Lθ2T,ξ+Lθ1S,ξ+Lθ2S,ξ)                 ▹ Update encoder

  • 22:

        θkiθkiαθkiLθki,ξk{1,2},i{T,S}                 ▹ Update Q functions

  • 23:

        θ¯ki(1τ)θ¯ki+τθkik{1,2},i{T,S}                 ▹ Soft update target-Q functions

  • 24:

    end procedure

  • 25:

    procedure UpdateActorTransfer(D)

  • 26:

        {(xt,it)}D                 ▹ Sample a mini batch of B observations and internal states

  • 27:

        stfξ(aug(xt))                 ▹ Data augmentation and encoding

  • 28:

        atπϕT(·|st,it)                 ▹ Sample action for timestep t+n

  • 29:

        Compute LϕT,LϕS using Equations (3) and (4).

  • 30:

        ϕiϕiαϕiLϕii{T,S}                 ▹ Update teacher’s and student’s policy

  • 31:

    end procedure