Algorithm 1: V-learning. |
---|
1 Initialize a class of policies, , and a model, V (π, s; θπ); |
2 Set k = 1 and initialize β1 to a starting value in ; |
3 while Not converged do |
4 Estimate ; |
5 Evaluate ; |
6 Set for some step size, αk, where is the gradient of ; |
7 end |