[Preprint]. 2023 Mar 15:arXiv:2302.02477v3. [Version 3]

Algorithm 1.

Train DLSM.

Input: Model weights $ψ$ , $ϕ$ , experience replay buffer $ℰ^{μ}$ , and learning rate $α$ .
Begin:
1:	Initialize $ψ$ , $ϕ$
2:	for iter in 1 : max_iter do
3:	Sample a trajectory $[(s_{0}, a_{0}, r_{0}, s_{1}), \dots, (s_{T - 1}, a_{T - 1}, r_{T - 1}, s_{T})] ~ ℰ^{μ}$
4:	$z_{0}^{ϕ} ~ q_{ϕ} (z_{0} ∣ s_{0})$
5:	$z_{0}^{ψ} ~ p_{ψ} (z_{0})$
6:	Run forward pass of DLSM following (13) and (15) for $t = 1 : T$ , and collect all variables needed to evaluate the all terms within the expectation in $ℒ_{E L B O}$ , which is denoted as ${\tilde{ℒ}}_{E L B O}$ .
7:	$ψ \leftarrow ψ + α \nabla_{ψ} {\tilde{ℒ}}_{E L B O}$
8:	$ϕ \leftarrow ϕ + α \nabla_{ϕ} {\tilde{ℒ}}_{E L B O}$
9:	end for