Skip to main content
. Author manuscript; available in PMC: 2022 Apr 1.
Published in final edited form as: Med Phys. 2021 Jan 11:10.1002/mp.14712. doi: 10.1002/mp.14712

Algorithm 1.

Standard DRL algorithm to train VTPN.

1. Initialize network coefficients θ;
for episode = 1, 2, … , Nepisode
for k = 1, 2, … , Npatient do
  2. Initialize λ, λbla, λrec, τbla, τrec
   Solve optimization problem (1) with {λ, λbla, λrec, τbla, τrec} for s1;
  for l = 1, 2, … , Ntrain do
   3. Select an action al with ϵ-greedy:
    Case 1: with probability ϵ, select al randomly;
    Case 2: otherwise al = arg maxa Q(sl, a; θ);
   4. Update TPPs using al;
   5. Solve optimization problem (1) with updated TPPs for sl+1
   6. Compute reward rl = Φ(sl+1) − Φ(sl);
   7. Store state-action pair {sl, al, rl, sl+1} in training data pool;
   8. Train θ with experience replay:
    Randomly select Nbatch training data from training data pool;
    Update θ using gradient descent algorithm to solve (3);
  end for
end for
end for
Output θ