Algorithm 1:
Tuning parameters selection via cross-validation
| 1 | Input: Data , a set of M policies {π1, · · ·, πM} ⊂ Π, a set of J candidate tuning parameters in the value function estimation, and a set of J candidate tuning parameters in the ratio function estimation. |
| 2 | Randomly split Data into K subsets: |
| 3 | Denote e(1) (m, j) and e(2) (m, j) as the total validation error for m-th policy and j-th pair of tuning parameters in value and ratio function estimation respectively, for m = 1, · · · M and j = 1, · · ·, J. Set their initial values as 0. |
| 4 | Repeat for m = 1, · · ·, M, |
| 5 | Repeat for k = 1, · · ·, K, |
| 6 | Repeat for j = 1, · · ·, J |
| 7 | Use to compute and by (6.2)–(6.3) and (6.4)–(6.5) using tuning parameters (μj, λj) and respectively; |
| 8 | Compute and and their corresponding squared Bellman errors mse(1) and mse(2) on the dataset Dk by Gaussian kernel regression; |
| 9 | Assign e(1) (m, j) = e(1) (m, j) + mse(1) and e(2) (m, j) = e(2)(m, j) + mse(2); |
| 10 | Compute j(1)* ∈ argminj maxm e(1) (m, j) and j(2)* ∈ argminj maxm e(2) (m, j) |
| 11 | Output: and . |