Skip to main content
[Preprint]. 2023 Mar 15:arXiv:2302.02477v3. [Version 3]

Algorithm 1.

Train DLSM.

Input: Model weights ψ, ϕ, experience replay buffer μ, and learning rate α.
Begin:
1: Initialize ψ, ϕ
2: for iter in 1 : max_iter do
3:  Sample a trajectory s0,a0,r0,s1,,sT1,aT1,rT1,sT~μ
4: z0ϕ~qϕz0s0
5: z0ψ~pψz0
6:  Run forward pass of DLSM following (13) and (15) for t=1:T, and collect all variables needed to evaluate the all terms within the expectation in ELBO, which is denoted as ˜ELBO.
7: ψψ+αψ˜ELBO
8: ϕϕ+αϕ˜ELBO
9: end for