Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2019 Apr 17;115(530):692–706. doi: 10.1080/01621459.2018.1537919

Algorithm 1: V-learning.
1 Initialize a class of policies, $Π = {π_{β} : β \in B}$ , and a model, V (π, s; θ^π);
2 Set k = 1 and initialize β¹ to a starting value in $B$ ;
3 while Not converged do
4 Estimate ${\hat{θ}}_{n}^{π_{β^{k}}} = arg {min}_{θ^{π_{β^{k}}} \in Θ} {Λ_{n} {(π_{β^{k}}, θ^{π_{β^{k}}})}^{⊺} Ω Λ_{n} (π_{β^{k}}, θ^{π_{β^{k}}}) + λ_{n} P (θ^{π_{β^{k}}})}$ ;
5 Evaluate ${\hat{V}}_{n, R} (π_{β^{k}}) = \int V (π_{β^{k}}, S; {\hat{θ}}_{n}^{π_{β^{k}}}) d R (s)$ ;
6 Set $β^{k + 1} = β^{k} + α^{k} \nabla_{β_{k}} {\hat{V}}_{n, R} (π_{β^{k}})$ for some step size, α^k, where $\nabla_{β_{k}} {\hat{V}}_{n, R} (π_{β^{k}})$ is the gradient of ${\hat{V}}_{n, R} (π_{β^{k}})$ ;
7 end