Skip to main content
. Author manuscript; available in PMC: 2017 Dec 10.
Published in final edited form as: Stat Med. 2016 Jul 24;35(28):5189–5209. doi: 10.1002/sim.7047

Table 1.

An approximate policy iteration algorithm for identifying dynamic policies

Step 0. Initialization
  • Step 0a: For each action A ∈ 2𝒜, choose a regression model , A; θA) and initialize the parameter estimates θ̃A (e.g. θ̃A = 0).

  • Step 0b: Choose a feature-extraction function f(·) (see §3.2.2).

  • Step 0c: Choose an exploration rule {En}, n ∈ {1, 2, …} such that En ∈ [0, 1] and En → 0 (see §3.2.1).

  • Step 0d: Choose a learning rule {λi}, i ∈ {1, 2, …} such that λi ∈ (0, 1] and λi → 1 (see §3.2.1).

  • Step 0e: Choose the number of optimization iterations N ≥ 100; set n ← 1 and i ← 1.

Step 1. While nN:
  • Step 1a. Simulate one trajectory of the epidemic:

    • Set the initial sampled history ĥ1 (e.g. ĥ1 ← {}).

    • For each decision point k ∈ {1, 2, 3, …} during this epidemic trajectory:

      • Check the termination condition: If simulation has reached the simulation horizon or the disease is eradicated, stop the simulation, store the index of the last decision point Knk and go to Step 1b.

      • Make a decision: Find the action Âk to be implement during the period [k, k + 1]:

        • Use the exploration rule En to determine if a greedy or explorative decision should be made.

        • If a greedy decision should be made, use the observed history ĥk to find the greedy decision according to Eq. (8)

        • If an explorative decision should be used, choose a random action for Âk.

      • Simulate to the next decision point: store the loss k and observation ŷk sampled during the decision period [k, k + 1] and update the history ĥk+1 ← {ĥk, Âk, ŷk}; set kk + 1.

  • Step 1b. Back-propagation

    • If the simulation has stopped because of disease eradication at time Kn:

      • Kn ← 0,

    • Else (the simulation reached the end of the simulation horizon):

    • q^KnargminA2AQ(f(h^Kn),A;θA);

    • θ̃ÂKn ← 𝒰(Kn, f(ĥKn), ÂKn; λi,θ̃ÂKn); ii + 1.

    • For k = Kn − 1 to 0 Step −1

      • k = k + γ q̂k+1;

      • θ̃Âk ← 𝒰(k, f(ĥk), Âk; λi,θ̃Ak); ii + 1.

  • Step 1c. Set nn + 1.

Step 2. Return , A;θ̃A) for each action A ∈ 2𝒜.