Skip to main content
. Author manuscript; available in PMC: 2023 Apr 4.
Published in final edited form as: Ann Stat. 2022 Dec 21;50(6):3364–3387. doi: 10.1214/22-aos2231

Algorithm 1:

Tuning parameters selection via cross-validation

1 Input: Data Zhh=1N, a set of M policies {π1, · · ·, πM} ⊂ Π, a set of J candidate tuning parameters μj,λjj=1J in the value function estimation, and a set of J candidate tuning parameters {(μj,λj)}j=1J in the ratio function estimation.
2 Randomly split Data into K subsets: Zhh=1N=Dkk=1K
3 Denote e(1) (m, j) and e(2) (m, j) as the total validation error for m-th policy and j-th pair of tuning parameters in value and ratio function estimation respectively, for m = 1, · · · M and j = 1, · · ·, J. Set their initial values as 0.
4 Repeat for m = 1, · · ·, M,
5  Repeat for k = 1, · · ·, K,
6   Repeat for j = 1, · · ·, J
7    Use Zhh=1N\Dk to compute η^nπm,α^πm and ν^(πm) by (6.2)–(6.3) and (6.4)–(6.5) using tuning parameters (μj, λj) and (μj,λj) respectively;
8    Compute δπm(;η^πm,Q^nπm) and επm(;H^nπm) and their corresponding squared Bellman errors mse(1) and mse(2) on the dataset Dk by Gaussian kernel regression;
9    Assign e(1) (m, j) = e(1) (m, j) + mse(1) and e(2) (m, j) = e(2)(m, j) + mse(2);
10 Compute j(1)* ∈ argminj maxm e(1) (m, j) and j(2)* ∈ argminj maxm e(2) (m, j)
11 Output: (μj(1)(1),λj(1)(1)) and (μj(2),λj(2)).