Algorithm 1.
Input: Model weights , , experience replay buffer , and learning rate . | |
Begin: | |
1: | Initialize , |
2: | for iter in 1 : max_iter do |
3: | Sample a trajectory |
4: | |
5: | |
6: | Run forward pass of DLSM following (13) and (15) for , and collect all variables needed to evaluate the all terms within the expectation in , which is denoted as . |
7: | |
8: | |
9: | end for |