. Author manuscript; available in PMC: 2023 Apr 4.

Published in final edited form as: Ann Stat. 2022 Dec 21;50(6):3364–3387. doi: 10.1214/22-aos2231

Algorithm 1:

Tuning parameters selection via cross-validation

1	Input: Data ${\{Z_{h}\}}_{h = 1}^{N}$ , a set of M policies {π₁, · · ·, π_M} ⊂ Π, a set of J candidate tuning parameters ${\{(μ_{j}, λ_{j})\}}_{j = 1}^{J}$ in the value function estimation, and a set of J candidate tuning parameters ${(μ_{j}^{'}, λ_{j}^{'})}_{j = 1}^{J}$ in the ratio function estimation.
2	Randomly split Data into K subsets: ${\{Z_{h}\}}_{h = 1}^{N} = {\{D_{k}\}}_{k = 1}^{K}$
3	Denote e⁽¹⁾ (m, j) and e⁽²⁾ (m, j) as the total validation error for m-th policy and j-th pair of tuning parameters in value and ratio function estimation respectively, for m = 1, · · · M and j = 1, · · ·, J. Set their initial values as 0.
4	Repeat for m = 1, · · ·, M,
5	Repeat for k = 1, · · ·, K,
6	Repeat for j = 1, · · ·, J
7	Use ${\{Z_{h}\}}_{h = 1}^{N} \ D_{k}$ to compute $({\hat{η}}_{n}^{π_{m}}, \hat{α} (π_{m}))$ and $\hat{ν} (π_{m})$ by (6.2)–(6.3) and (6.4)–(6.5) using tuning parameters (μ_j, λ_j) and $(μ_{j}^{'}, λ_{j}^{'})$ respectively;
8	Compute $δ^{π_{m}} (\cdot; \hat{η} (π_{m}), {\hat{Q}}_{n}^{π_{m}})$ and $ε^{π_{m}} (\cdot; {\hat{H}}_{n}^{π_{m}})$ and their corresponding squared Bellman errors mse⁽¹⁾ and mse⁽²⁾ on the dataset D_k by Gaussian kernel regression;
9	Assign e⁽¹⁾ (m, j) = e⁽¹⁾ (m, j) + mse⁽¹⁾ and e⁽²⁾ (m, j) = e⁽²⁾(m, j) + mse⁽²⁾;
10	Compute j⁽¹⁾* ∈ argmin_j max_m e⁽¹⁾ (m, j) and j⁽²⁾* ∈ argmin_j max_m e⁽²⁾ (m, j)
11	Output: $(μ_{j^{(1) }}^{(1)}, λ_{j^{(1) }}^{(1)})$ and $(μ_{j^{(2) }}^{'}, λ_{j^{(2) }}^{'})$ .