| Algorithm 1: V-learning. |
|---|
| 1 Initialize a class of policies, , and a model, V (π, s; θπ); |
| 2 Set k = 1 and initialize β1 to a starting value in ; |
| 3 while Not converged do |
| 4 Estimate ; |
| 5 Evaluate ; |
| 6 Set for some step size, αk, where is the gradient of ; |
| 7 end |