Algorithm 1.
Standard DRL algorithm to train VTPN.
1. Initialize network coefficients θ; |
for episode = 1, 2, … , Nepisode |
for k = 1, 2, … , Npatient do |
2. Initialize λ, λbla, λrec, τbla, τrec |
Solve optimization problem (1) with {λ, λbla, λrec, τbla, τrec} for s1; |
for l = 1, 2, … , Ntrain do |
3. Select an action al with ϵ-greedy: |
Case 1: with probability ϵ, select al randomly; |
Case 2: otherwise al = arg maxa Q(sl, a; θ); |
4. Update TPPs using al; |
5. Solve optimization problem (1) with updated TPPs for sl+1 |
6. Compute reward rl = Φ(sl+1) − Φ(sl); |
7. Store state-action pair {sl, al, rl, sl+1} in training data pool; |
8. Train θ with experience replay: |
Randomly select Nbatch training data from training data pool; |
Update θ using gradient descent algorithm to solve (3); |
end for |
end for |
end for |
Output θ |