. Author manuscript; available in PMC: 2017 Dec 10.

Published in final edited form as: Stat Med. 2016 Jul 24;35(28):5189–5209. doi: 10.1002/sim.7047

Table 1.

An approximate policy iteration algorithm for identifying dynamic policies

Step 0. Initialization

Step 0a: For each action A ∈ 2^𝒜, choose a regression model Q̃(·, A; θ_A) and initialize the parameter estimates θ̃_A (e.g. θ̃_A = 0).
Step 0b: Choose a feature-extraction function f(·) (see §3.2.2).
Step 0c: Choose an exploration rule {E_n}, n ∈ {1, 2, …} such that E_n ∈ [0, 1] and E_n → 0 (see §3.2.1).
Step 0d: Choose a learning rule {λ_i}, i ∈ {1, 2, …} such that λ_i ∈ (0, 1] and λ_i → 1 (see §3.2.1).
Step 0e: Choose the number of optimization iterations N ≥ 100; set n ← 1 and i ← 1.

Step 1. While n ≤ N:

Step 1a. Simulate one trajectory of the epidemic:
- Set the initial sampled history ĥ₁ (e.g. ĥ₁ ← {}).
- For each decision point k ∈ {1, 2, 3, …} during this epidemic trajectory:
  - Check the termination condition: If simulation has reached the simulation horizon or the disease is eradicated, stop the simulation, store the index of the last decision point K_n ← k and go to Step 1b.
  - Make a decision: Find the action Â_k to be implement during the period [k, k + 1]:
    - Use the exploration rule E_n to determine if a greedy or explorative decision should be made.
    - If a greedy decision should be made, use the observed history ĥ_k to find the greedy decision according to Eq. (8)
    - If an explorative decision should be used, choose a random action for Â_k.
  - Simulate to the next decision point: store the loss r̂_k and observation ŷ_k sampled during the decision period [k, k + 1] and update the history ĥ_k₊₁ ← {ĥ_k, Â_k, ŷ_k}; set k ← k + 1.
Step 1b. Back-propagation
- If the simulation has stopped because of disease eradication at time K_n:
  - q̂_{K_n} ← 0,
- Else (the simulation reached the end of the simulation horizon):
- ${\hat{q}}_{K_{n}} \leftarrow \underset{A \in 2^{A}}{arg min} \tilde{Q} (f ({\hat{h}}_{K_{n}}), A; {\tilde{θ}}_{A})$ ;
- θ̃_{ÂK_n} ← 𝒰(q̂_{K_n}, f(ĥ_{K_n}), Â_{K_n}; λ_i,θ̃_{ÂK_n}); i ← i + 1.
- For k = K_n − 1 to 0 Step −1
  - q̂_k = r̂_k + γ q̂_k₊₁;
  - θ̃_{Â_k} ← 𝒰(q̂_k, f(ĥ_k), Â_k; λ_i,θ̃_{A_k}); i ← i + 1.
Step 1c. Set n ← n + 1.

Step 2. Return Q̃(·, A;θ̃_A) for each action A ∈ 2^𝒜.