. Author manuscript; available in PMC: 2022 Apr 1.

Published in final edited form as: Med Phys. 2021 Jan 11:10.1002/mp.14712. doi: 10.1002/mp.14712

Algorithm 1.

Standard DRL algorithm to train VTPN.

1. Initialize network coefficients θ;

for episode = 1, 2, … , N_episode

for k = 1, 2, … , N_patient do

2. Initialize λ, λ_bla, λ_rec, τ_bla, τ_rec

Solve optimization problem (1) with {λ, λ_bla, λ_rec, τ_bla, τ_rec} for s¹;

for l = 1, 2, … , N_train do

3. Select an action a^l with ϵ-greedy:

Case 1: with probability ϵ, select a^l randomly;

Case 2: otherwise a^l = arg max_a Q(s^l, a; θ);

4. Update TPPs using a^l;

5. Solve optimization problem (1) with updated TPPs for s^l⁺¹

6. Compute reward r^l = Φ(s^l⁺¹) − Φ(s^l);

7. Store state-action pair {s^l, a^l, r^l, s^l⁺¹} in training data pool;

8. Train θ with experience replay:

Randomly select N_batch training data from training data pool;

Update θ using gradient descent algorithm to solve (3);

end for

Output θ