Skip to main content
. 2023 Jan 9;23(2):762. doi: 10.3390/s23020762
Algorithm 1 VIB based meta-reinforcement learning training algorithm.
  • 1:

    Input:D: training data set; ηθ,ηφ,ηψ,ηω denote learning rate; {Tm}m=1,,Mp(T): meta-training task set.

  • 2:

    Output: policy network πθ and latent space encoder Eω.

  • 3:

    Setting parameter for the target network: ψψ¯; the initial sample set for each task Dm

  • 4:

    for epoch do

  • 5:

         for Tm do

  • 6:

              Initializing each trajectory: eTm={}

  • 7:

              for k=1,,N do

  • 8:

                 Latent space inference zEωzeTm

  • 9:

                 The current policy πθas,z interact with each task and obtain sample Dm

  • 10:

                Updating eTm={(sj,aj,sj,rj)}j:1KDm

  • 11:

             end for

  • 12:

        end for

  • 13:

        for step do

  • 14:

             for Tm do

  • 15:

               Sampling from the training data set: eTm,dmDm

  • 16:

               Latent space inference: zEωzeTm

  • 17:

               Computing the action-state value function: JmQ=JEφdm,z

  • 18:

               Computing the state value function: JmV=JEψdm,z

  • 19:

               Computing the policy cost function: Jmπ=JEθdm,z

  • 20:

               Computing the latent space encoder cost function: JmE=JEωdm,z

  • 21:

             end for

  • 22:

             Updating the action-state value function network: φt+1φtηφ^φmJmQ

  • 23:

             Updating the state value function network: ψt+1ψtηψ^ψmJmV

  • 24:

             Updating the policy network: θt+1θtηθ^θmJmπ

  • 25:

             Updating the latent space encoder network: ωt+1ωtηω^ωmJmE

  • 26:

        end for

  • 27:

        Updating the target network ψψ¯

  • 28:

    end for