Skip to main content
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2019 Apr 17;115(530):692–706. doi: 10.1080/01621459.2018.1537919
Algorithm 1: V-learning.
1 Initialize a class of policies, Π={πβ:βB}, and a model, V (π, s; θπ);
2 Set k = 1 and initialize β1 to a starting value in B;
3 while Not converged do
4  Estimate θ^nπβk=argminθπβkΘ{Λn(πβk,θπβk)ΩΛn(πβk,θπβk)+λnP(θπβk)};
5  Evaluate V^n,R(πβk)=V(πβk,S;θ^nπβk)dR(s);
6  Set βk+1=βk+αkβkV^n,R(πβk) for some step size, αk, where βkV^n,R(πβk) is the gradient of V^n,R(πβk);
7 end