Abstract
Although commercial treatment planning systems (TPSs) can automatically solve the optimization problem for treatment planning, human planners need to define and adjust the planning objectives/constraints to obtain clinically acceptable plans. Such a process is labor-intensive and time-consuming. In this work, we show an end-to-end study to train a deep reinforcement learning (DRL) based virtual treatment planner (VTP) that can behave like a human to operate a dose-volume constrained treatment plan optimization engine following the parameters used in Eclipse TPS for high-quality treatment planning. We considered the prostate cancer IMRT treatment plan as the testbed. The VTP took the dose-volume histogram (DVH) of a plan as input and predicted the optimal strategy for constraint adjustment to improve the plan quality. The training of VTP followed the state-of-the-art Q-learning framework. Experience replay was implemented with epsilon-greedy search to explore the impacts of taking different actions on a large number of automatically generated plans, from which an optimal policy can be learned. Since a major computational cost in training was to solve the plan optimization problem repeatedly, we implemented a graphical processing unit (GPU)-based technique to improve the efficiency by 2-fold. Upon the completion of training, the established VTP was deployed to plan for an independent set of 50 testing patient cases. Connecting the established VTP with the Eclipse workstation via the application programming interface, we tested the performance the VTP in operating Eclipse TPS for automatic treatment planning with another two independent patient cases. Like a human planner, VTP kept adjusting the planning objectives/constraints to improve plan quality until the plan was acceptable or the maximum number of adjustment steps was reached under both scenarios. The generated plans were evaluated using the ProKnow scoring system. The mean plan score (± standard deviation) of the 50 testing cases were improved from 6.18 ± 1.75 to 8.14 ± 1.27 by the VTP, with 9 being the maximal score. As for the two cases under Eclipse dose optimization, the plan scores were improved from 8 to 8.4 and 8.7 respectively by the VTP. These results indicated that the proposed DRL-based VTP was able to operate the in-house dose-volume constrained TPS and Eclipse TPS to automatically generate high-quality treatment plans for prostate cancer IMRT.
1. INTRODUCTION
Intensity modulated radiation therapy (IMRT) has been widely used in modern clinic for cancer treatment (Bortfeld, 2006). IMRT holds the potency to deliver a high therapeutic dose to the tumor volume while sparing the nearby organs at risk (OARs). Consequently, it provides the possibility to improve the local tumor control as well as to retain the patients’ quality of life (Cho, 2018).
One critical component affecting the effectiveness of IMRT is the quality of IMRT treatment plan, which defines the beam characteristics needed to achieve the desired radiation dose distribution. This is often formulated as an optimization problem in a multi-objective function and automatically solved by the modern treatment planning systems (TPSs) (Intensity Modulated Radiation Therapy Collaborative Working Group, 2001). However, depending on the specific objectives and constraints applied to define the plan optimization problem, the generated plan may not be satisfying. To obtain a clinically acceptable plan, it typically requires a human planner to repetitively observe intermediate optimization results and adjust the objectives/constraints to improve the plan quality. Such a planning process can be labor intensive and time consuming. Hence, the final plan quality could highly depend on the planning experience and the planning time attained by the planner (Atun et al., 2015). A fully-automatic treatment planning system that can automatically adjust the plan objectives/constraints for high-quality IMRT treatment planning is then critical to advance the radiation clinic using IMRT for cancer treatment.
To date, there are multiple techniques developed to automate the treatment planning process (Hussein et al., 2018). These include the knowledge-based planning (KBP) method (Chanyavanich et al., 2011; Fogliata et al., 2014; Hussein et al., 2016; Chang et al., 2016; Wang et al., 2017; Kubo et al., 2017), the multicriteria optimization (MCO) method (Craft et al., 2012; Chen et al., 2012; Thieke et al., 2007), the protocol-based automatic iterative optimization (PB-AIO) approach (Yan et al., 2003; Zhang et al., 2011; Wang et al., 2012; Xhaferllari et al., 2013), etc. Key to the KBP method is to utilize historically achieved, high-quality treatment plans to predict an achievable dose in a new patient of a similar population, or to generate a better starting point for a human planner to start with (Chanyavanich et al., 2011; Fogliata et al., 2014; Hussein et al., 2016; Chang et al., 2016; Wang et al., 2017; Kubo et al., 2017). In this method, the quality of the newly generated plan can highly depend on the historical plans or the anatomy similarity between the two sets of patients. Moreover, the predicted dose is not guaranteed to be achievable and further adjustments of the plan configurations from a human planner might be required (Hussein et al., 2018). Central to the MCO method is the concept of the ‘pareto optimal solution’, which denotes a plan that cannot be further improved for a given objective without degrading one or multiple other objectives (Craft et al., 2012; Chen et al., 2012; Thieke et al., 2007). Yet, a ‘pareto optimal solution’ may not be clinically desired. It may need to generate many plans before a clinic acceptable plan can be selected or interpolated, which can be computational resource or manual interaction demanding. As for the PB-AIO approach, script-based or fuzzy-logic-based automatic adjustments of the optimization objectives and constraints are established to gradually improve the plan quality to clinic acceptable level (Yan et al., 2003; Zhang et al., 2011; Wang et al., 2012; Xhaferllari et al., 2013). A concern of this approach is that it is not easy to optimize the parameter adjustment process, such that the planning efficiency may not be assured (Hussein et al., 2018).
Most recently, along with the rapid development of deep learning (Krizhevsky et al., 2012) and reinforcement learning (Sutton and Barto, 2018), a new architecture named “intelligent automatic treatment planning (IATP) framework” has been put forward (Shen et al., 2019; Shen et al., 2020; Shen et al., 2021a; Shen et al., 2021b). In IATP framework, an intelligent virtual treatment planner (VTP) is constructed to operate the in-house TPS like a human planner to generate high-quality treatment plans. Specifically, Shen et. al. introduced the deep neural network-based reinforcement learning (Mnih et al., 2015) to automate the weighting parameter tuning in inverse treatment planning with a proof-of-principle study in high dose-rate brachytherapy for cervical cancer (Shen et al., 2019), and then extended the principle to external radiotherapy with developing a virtual treatment planner (VTP) for prostate cancer IMRT planning (Shen et al., 2020). The VTP-based treatment planning was proved to be able to generate high quality treatment plans with a relatively high efficiency.
Yet, in these studies, the in-house developed TPS was relatively simple in the aspects of objective functions and adjusted parameters, compared to that employed in the commercial TPS. It brought concerns that the concept of VTP-based treatment planning may not work for complex TPS, for example, the commercial TPS, where dose-volume constraints were typically applied. Recently, a reinforcement-learning based Eclipse treatment planning was tested to be effective to generate treatment plans for pancreas stereotactic body radiation therapy (Zhang et al., 2021), yet much more efforts are still needed to investigate the effectiveness of IATP-based automatic treatment planning for broad clinical applications. Hence, it is desired to implement an in-house developed complex TPS, such as a dose-volume constrained TPS, to comprehensively investigate the effectiveness of VTP-based treatment planning with a goal that once the VTP architecture was tested to be effective, it could be easily adapted to operate a commercial TPS.
In this work, we implemented a dose-volume constrained TPS following the parameters used in Eclipse TPS for prostate cancer IMRT. We especially designed an end-to-end VTP neural network to operate the developed TPS. We trained and tested the VTP on two different sets of patient cases. We then connected the established VTP with the Eclipse workstation via the application programming interface (API) and tested the performance of the VTP-based Eclipse automatic treatment planning with another two independent patient cases. We found that the established IATP framework could operate the in-house dose-volume constrained TPS and Eclipse TPS for successful treatment planning in prostate cancer IMRT. We reported the method, results and discussions in the following sections.
2. METHODS AND MATERIALS
2.1. The overall architecture of the IATP framework
We illustrate the overall architecture of the proposed IATP framework in Figure 1. As is shown, the TPS started the inverse treatment planning with a trivial set of treatment planning parameters (TPPs). The quantification system then quantified the quality of the produced treatment plan as a numerical score S. If S was lower than the predefined maximum plan score, the VTP would observe the DVH of the current treatment plan and decided how to adjust the TPPs. After that, the VTP would perform the inverse treatment planning again under the updated TPPs. The process was repeated until a satisfying treatment plan was obtained or the VTP reached its maximum iteration for the TPP tuning. Compared to the conventional human-planner-based treatment planning, the IATP framework features the automatic decision-making process for TPP adjustment with the VTP system.
Figure 1.

Flowchart of the intelligent automatic treatment planning (IATP) framework. VTP: virtual treatment planner. TPS: treatment planning system.
To establish an IATP framework suitable to operate a commercial TPS, we need an especial design of the in-house TPS system and the VTP network. The details of the IATP framework were discussed in the following subsections 2.2–2.6.
2.2. The inverse treatment planning optimization algorithm
We developed an in-house dose-volume constrained TPS following the detailed documentation of the plan optimization method for Eclipse TPS (Varian, 2014). We especially considered the following features for IMRT treatment planning in Eclipse TPS: 1) upper and lower constraints (each constraint contains volume, thresholding dose and priority) to optimize the dose distribution inside the planning target volume (PTV), 2) upper constraints for the OARs, 3) dose-volume-histogram (DVH)-based optimization, and 4) the dose deposition coefficient matrix. With considering points 1)-3), we formed the objective function as follows:
| (1) |
Eq. (1) contains three terms: the first term ∥ · ∥− is the standard l2 norm that computes only the negative elements. It requires the dose deposited to the PTV (i.e. Mx) no lower than the prescription dose dp. Meanwhile, we have D95%(Mx) = dp as the hard lower-constraint that 95% of the PTV volume receives a dose no lower than the prescription dose. ∥ · ∥+ in the second and third terms is the standard l2 norm that computes only the positive elements. VPTV, tdp and λ in the second term are the percent volume of PTV, the upper thresholding dose and the priority factor, which together serve as the upper constraint for the PTV. Similarly, Vi, tidp and λi in the third term form the upper constraint for the ith OAR. In addition, M and Mi are the dose deposition coefficient matrices for the PTV and ith OAR, respectively, which specified the dose delivered to each voxel inside the patient body from each beamlet under a unit output. In this work, they were computed by Eclipse in a beamlet-by-beamlet fashion, while parallel computing technique was utilized to improve the computation efficiency. After that, they were extracted in a sparse matrix format and implemented in the in-house developed dose optimization engine. x ≥ 0 is the beam fluence map to optimize.
It is worth mentioning that in the iterative optimization process to solve Eq. (1), VPTV and Vi always refer to those voxels having higher dose than those non-selected voxels, following the idea of DVH-based treatment optimization. In all, we have the lower constraint for PTV as a hard constraint in our objective function and those upper constraints for PTV and OARs (λ, λi, t, ti, Vptv and Vi) as free treatment planning parameters (TPPs) that will be tuned by the VTP.
We took the prostate cancer IMRT as the testbed and considered cases with one target (prostate) and two critical OARs (the bladder and the rectum) in this work. We then had nine TPPs to tune in the treatment planning process: λ, λbladder, λrectum, t, tbladder, trectum, Vptv, Vbladder and Vrectum. With a given set of TPPs, the optimization problem of the prostate IMRT treatment planning was solved using the alternating direction method of multipliers (ADMM) (Boyd, 2010), which alternatively updated the fluence map by enforcing the gradient of the objective function close to zero in each step.
2.3. The virtual treatment planner network
After building the dose-volume constrained TPS, we employed the deep reinforcement learning (DRL) (Mnih et al., 2015) for the VTP development, which is a combination of the deep neural network (Krizhevsky et al., 2012) and the reinforcement learning (Sutton and Barto, 2018). Specifically, under the framework of reinforcement learning, we considered the entire treatment planning process as tasks that the agent (the VTP) interacted with the environment (the TPS) in a sequence of observations (intermediate treatment plans), actions (TPP adjustments) and rewards (changes in the planning score). Noting that the TPP adjustment in one step could impact the decision making in future steps, we made the standard assumption that the total reward at time t as . Here, was the reward at step i, γ ∈ [0, 1] was the discount factor for future rewards and T was the terminate step for the planning process. We then had the goal for the VTP that it could select those actions to maximize the future rewards. We applied the optimal action-value function (Q value function) from the Q-learning algorithm (CJ Watkins, 1992) to represent the maximum expected return at step t as
| (2) |
Here, π represents a policy mapping state s to action a.
Since we do not know the exact form of the Q value function, we applied the deep neural network (DNN) architecture to parametrize it, considering the flexibility of hyper-dimensional representation of DNN (Mnih et al., 2015). The specific DNN architecture used in this work was illustrated in Figure 1. Specifically, we had one DNN network to be responsible for one TPP tuning. With 9 TPPs, we created a VTP network composed of 9 subnetworks (Figure 1(a)). All subnetworks shared the same architecture, which was illustrated in Figure 1(b). As is shown, it contained four batch normalization layers, seven Leaky Rectified Linear Unit (LeakyReLU) layers, four 1D convolutional layers, four 1D max-pooling layers and one flatten layer, which is different from that used in previous work (Shen et al., 2020).
Noting that the in-house TPS was DVH based, we sampled the DVH curves for the PTV and OARs to generate the input state for the VTP. We had three Q-values as outputs for each subnetwork, which corresponded to action options of increasing, decreasing or retaining the TPP magnitude, respectively. We empirically selected the numerical values for the TPP adjustments (Table 1). We expected that the specific value selections would not affect the VTP performance, but only the convergence speed. After obtaining the Q values from all nine subnetworks, the action resulting the highest Q value would be selected and fed into the TPS for treatment plan optimization.
Table 1.
The empirical magnitude changes for different TPPs in step j based on their values in step j − 1 for different action types.
| action 1 | |||||
| action 2 | |||||
| action 3 |
To reflect the effect of VTP operations on plan quality improvement, it was reasonable to compute the reward r as the difference of the plan qualities after and before the TPP adjustment by the VTP. That is, . Here, we used ProKnow (ProKnow Systems, Sanford FL, USA) for prostate cancer IMRT plan to obtain φ(s). Relevant to the treatment planning optimization algorithm as stated in section 2.1, nine clinical criteria in the ProKnow scoring system was used in this study: DPTV(0.03 cc) , Vbladder (80 Gy), Vbladder (75 Gy) , Vbladder (70 Gy) , Vbladder (65 Gy) , Vrectum (75 Gy), Vrectum (70 Gy), Vrectum (65 Gy), and Vrectum (60 Gy) with 79.5 Gy the prescription dose to 95% volume of the PTV. For each treatment plan to be evaluated, it could receive a score ci ∈ [0, 1] for criterion i, following the same rule as defined in Table 1 of reference (Shen et al., 2021a). Hence, the total score that the plan could receive was , the maximum of which was 9 and the minimum was 0. It is worthy to mention that we didn’t employ the ProKnow score for D95% due to that we set D95% = 79.5 Gy as a hard constraint for the PTV optimization in our optimization engine.
2.4. Training of the VTP network
The goal of training the established VTP network was to determine the parameters (weights) θ such that . Following the idea of Bellman equation, the optimal value function Q*(s, a) could be rewritten as
| (3) |
where the next state s′ was formed by taking an action a for current state s, while the corresponding reward was r. We then obtained the optimal value for the (s, a) pair via taking the action a′ that maximized Q* for s′. Consequently, we could train the Q-network by adjusting θi at iteration i to reduce the mean square error in the Bellman equation, forming the loss function Li(θi) at iteration i as
| (4) |
Here, was the approximate target value for Q*(s, a) at iteration i. θi+ were the network parameters for the target Q-network. Once θi+ was fixed, the loss function Li(θi) was well defined and could be solved via the stochastic gradient decent method (LeCun et al., 1998). After that, θi+ could be updated based on θj (j ≤ i) such that we could alternatively optimize the Q-network and target Q-network. To reduce the potential divergence or oscillation for the update of the target Q-network (Q(θ+)), we only updated θ+ every N steps with each update a clone of θ from previous Q-network.
To make the VTP training efficient, we employed the experience replay method (Lin, 1992) for the updates of θi at each iteration i, as it was known to break potential correlations among observation sequences. Specific to our problem, we had et = (st,at,st+1,rt) representing the experience acquired at step t with observing an initial treatment plan st, applying TPP adjustment at to the TPS system, generating a new treatment plan st+1 and obtaining a reward rt. As et was continuously generated during the training process, we could create a replay memory to place them as D = {e1, e2, …., et,…}. Each time to update θi, we randomly sampled a minibatch of experiences with size Lm from D and applied it to solve Equation (4). Here, the size of D was fixed as LD. When D was full, we would continue to pop-in those newly generated et′s and pop-out those oldest elements. The minibatch size was set to be LM, with LM < LD. The specific values of LD and LM were manually tuned via observing the network training performance.
To balance the exploration and exploitation process for effective Q-learning, we employed the ε greedy policy in the VTP network training. Specifically, at the initial training stage, the agent didn’t have much experience to learn from and hence we set a relatively large ε (ε = 0.999) to allow it actively explore the state-action space with randomly choosing a TPP adjustment option for the next-step treatment planning. Along with the training time elapsed, the agent accumulated more and more experience from which it could exploit optimal strategies. We then gradually reduced ε with setting its value at the Nth episode as εN = 0.999/(0.01 * Nepisode + 1).
2.5. Improving training efficiency with Graphical Processing Unit parallel computing
It took time to train the established VTP considering that it contained nine subnetworks. To improve the training efficiency, except for employing the replay memory strategy, we also applied the cProfile technique (Python build-in module) to analyze the run time for each individual step. We found that the most time-consuming portion was relevant to the operation of compressed sparse matrices, including the multiplication of a compressed sparse column (CSC) or row (CSR) with a vector, the column indexing of a CSR, etc. In our algorithm, main sparse matrices were the dose deposition coefficient matrices M and Mi as denoted in Eq. (1), which were frequently operated during the treatment plan optimization process. Hence, to further improve the network training efficiency, we boosted the TPS system via employing the Graphical Unit Processing (GPU) parallel computing technique upon the Nvidia CUDA platform. To support the Pythonic access to the Nvidia’s CUDA parallel computation, we utilized the PYCUDA API (application programming interface).
2.6. Case studies and evaluations under the in-house TPS and Eclipse TPS
We collected 64 patient cases with prostate cancer IMRT. They were divided into three groups: 10 for training, 2 for verification and the rest 52 for testing. The Q-network was built upon the TensorFlow platform in Python language. The in-house developed TPS was constructed on top of the CUDA platform with PYCUDA technique. The entire algorithm was executed on a GPU server with 8 Intel Xenon 2.30 GHz CPU processors, 32GB memory, and 8 Tesla V100-SXM2 GPU Cards.
The Q-network was trained for 200 episodes with each episode containing a maximum of 30 steps. At the beginning of each episode, we initialized the treatment plan for each training patient case with 7 beam angles and a uniform fluence map for each beam angle. It was then fed into the in-house developed TPS system with a trivial TPP setting (all TPPs = 1 except for Vptv = 0.1) to generate an initial treatment plan. We then sampled a random number ζ ∈ [0, 1]. When ζ > ε, the DVH of the initial plan would be fed into the Q-network. The Q-network made a TPP adjustment decision and received a corresponding reward. Otherwise, a TPP adjustment option would be randomly picked up from the available TPP adjustment pool. After that, the new TPP was fed into the TPS system for the next-round of treatment plan optimization. This process was repeated until reaching a maximal time step of 30 or a maximal planning score of 9. Meanwhile, the obtained TPP adjustment experience was placed into the replay memory for the update of the Q-network and target Q-network following the method discussed in section 2.3.
After training each episode with ten patient cases, we verified the obtained network with the two verification cases. With obtaining promising results from both training and verification process, we comprehensively tested the network with fifty testing cases. Lastly, to test whether the VTP developed and trained based on the in-house TPS was effective to operate a commercial TPS, we connected the trained VTP with Eclipse research workstation via API and enabled the VTP-guided Eclipse treatment planning for the rest 2 testing cases. We quantified the plan quality for all patient cases with the ProKnow score system.
3. RESULTS
3.1. Results for the training and verification cases
The optimal performance of the developed VTP was found at episode 190. The corresponding hyperparameter settings were listed in Table 2.
Table 2.
The hyperparameters and their values used to train the VTP.
| Hyperparameter | Value | Description |
|---|---|---|
| learning rate | 1x10−5 | The learning rate used by the VTP |
| minibatch size | 16 | The number of training samples that are used to update θi in Equation (4) |
| target update frequency | 500 | The frequency with which the target parameters θ+ are updated |
| discount factor | 0.7 | Discount factor γ used by the Q learning |
| initial exploration | 0.999 | Initial value of ε from ε-greedy exploration |
| final exploration | 0.333 | Final value of ε from ε-greedy exploration |
| replay memory | 125000 | The number of state action pairs that are stored |
| number of episodes | 200 | Total number of training episodes |
| number of steps | 30 | Maximum number of time steps in each episode |
In Figure 3, we showed a representative case to illustrate how the VTP iteratively observed a treatment plan DVH generated by the in-house TPS and made a TPP adjustment decision in the network training process. As is shown, the plan DVH at the beginning step failed to satisfy six out of eight criteria for OARs, resulting in a low initial plan score of 3. The VTP observed the plan and decided to lower the threshold dose value for the bladder (tBLA in Figure 3(e)), which produced a plan with a better bladder sparing in step 1 (Figure 3(b)). It then lowered the threshold dose value and volume for rectum in the subsequent few steps, resulting in a treatment plan with a good sparing for both bladder and rectum but an overdose in PTV in step 8 (Figure 3(c)). The VTP then continuously boosted the weight for PTV and finally generated a plan with a full plan score of 9 in step 14 (Figure 3(d)).
Figure 3.

The illustration of the VTP-based treatment planning process for a representative training patient case. (a)-(d): the dose fluence maps and DVHs for the treatment plan before TPP adjustment, after one, eight and fourteen steps of TPP adjustments by the VTP, respectively. (e) The specific TPP adjustment made by the VTP. Here, ‘BLA’ means bladder and ‘REC’ represents for rectum.
Statistically, the average and standard deviation of the initial plan scores over the 10 training patient cases was 5.51 ± 2.16. After the VTP guided treatment planning with the in-house TPS, the average and standard deviation of the final plan scores was 8.35± 2.59. Six of them reached the maximum score of 9. In addition, the two verification cases had an initial score of 4.5 ± 1.50 and ended up with a final score of 8.69 ± 0.27. These results indicated that the VTP agent was trained as expected.
3.2. Results for the testing cases under the in-house TPS and Eclipse TPS
In Figure 4, we illustrated the VTP based treatment planning for a representative testing patient case. From Figure 4(a), before the VTP based treatment planning, a portion of bladder and rectum volumes were exposed to the high prescription dose, resulting in a low initial plan score of 4.71. The VTP then decided to decrease the PTV volume that received a dose larger than the prescription dose (Vptv in Eq. (1)) and decreased the threshold dose value for the rectum (trectum in Eq. (1)) in steps 1-2 (Figure 4(e)). It resulted in an effective dose sparing in the rectum while the bladder dose was still high (Figure 4(b)). The VTP then decided to decrease the threshold dose for the bladder (tbladder in Eq. (1)) in step 3, which effectively reduced the dose exposure to rectum and bladder but at the expense of overdosing to PTV (Figure 4(c)). The VTP then gradually increased the weighting factor of PTV overdose term (λ in Eq. (1)) in the steps 4-9, ending up with a nearly optimal treatment plan (plan score 8.95 out of 9).
Figure 4.

The illustration of the VTP-based treatment planning process for a representative testing patient case. (a)-(d): the dose fluence maps and DVHs for the treatment plan before TPP adjustment, after two, three and nine time-steps of TPP adjustments by the VTP, respectively. (e) The specific TPP adjustment made by the VTP. Here, ‘BLA’ means bladder and ‘REC’ represents for rectum.
We then analyzed the statistical distributions of the plan scores before and after the VTP based treatment planning for all 50 testing cases. Specifically, we divided the initial treatment plans into 8 categories. For the first 7 categories, the treatment plans satisfied a ≤ plan score<b, with a = 2,3,…, 8 and b = a + 1. As for the last category, the treatment plans were with a plan score equaling 9. We then performed the analysis in two ways. In the first analysis, we tracked the score changes after the treatment planning for all 8 categories, computed the average and standard deviation for each category before and after the treatment planning and showed the result in Figure 5(a). As is shown, after the VTP guided treatment planning, the average plan scores were significantly improved for the first seven categories, which remained the same for the last category (maximal plan score). This behavior indicated that the trained VTP was effective in operating the dose optimization engine in generating high-quality treatment plans even for those cases with relatively low initial plan scores. In the second analysis, we divided the final treatment plans into another 8 categories based on their own plan scores. We counted the total case numbers belonging to each category for both initial and final treatment plans and plotted them side by side in Figure 5(b). As is shown, before the VTP based treatment planning, most patient cases had a plan score between 5 and 6. After the plan optimization, the majority ended up with a score 8 and above. Both distributions showed the capability of our trained VTP in performing high-quality treatment planning for prostate cancer IMRT. Overall, the average and standard deviations of the 50 cases were 6.18 ± 1.75 and 8.14 ± 1.27 before and after the VTP based treatment plan optimization.
Figure 5.

The plan score distributions for the 50 testing patient cases before and after the VTP guided treatment planning. (a) The 50 cases were clustered into 8 groups based on their initial plan scores. For each group, the mean plan scores before and after the VTP based planning were represented by the heights of the blue and red bars. The standard deviations were plotted in black. (b) The number of patient cases within different plan score ranges (e.g., ‘3’ means a plan score larger or equal than 3 and smaller than 4) before and after VTP based treatment planning (in blue and red color, respectively).
We also analyzed the dose volume distributions of the 50 testing patient cases following the ProKnow score system. The results were listed in Table 3. As is shown, compared to the initial treatment plans, the average percent volumes exposing to doses ≥75, 70 and 65 Gy for bladder and that exposing to dose ≥75, 70, 65 and 60 Gy for rectum have all been significantly reduced after the VTP based treatment planning. On the other hand, the average percent volume of bladder exposing to doses ≥80 Gy and the average minimum dose that 0.03 cm3 of PTV was exposed were slightly increased, but still well below the criterion values. This dose volume distribution of the testing patient cases indicated that the trained VTP was able to make effective TPP adjustment decisions that could maximize its reward (planning score).
Table 3.
The mean and standard deviation (std.) of the dose-volume values for the 50 testing cases. “Criterion” means the requirement from the ProKnow score system. “Before” and “After” represent treatment plans obtained before and after the VTP guided treatment planning.
| Bladder | Rectum | PTV | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| V(80 Gy) | V(75 Gy) | V(70 Gy) | V(65 Gy) | V(75 Gy) | V(70 Gy) | V(65 Gy) | V(60 Gy) | D(0.03 cc) | ||
| Criterion | <20% | <30% | <40% | <55% | <20% | <30% | <40% | <55% | <87.12 Gy | |
| Before | Mean | 2.4% | 19.8% | 24.0% | 26.1% | 26.6% | 34.5% | 39.1% | 42.9% | 80.8 Gy |
| Std. | 2.4% | 8.6% | 10.3% | 10.5% | 13.5% | 14.5% | 14.8% | 15.5% | 0.24 Gy | |
| After | Mean | 5.9% | 12.5% | 15.5% | 18.5% | 5.2% | 7.5% | 12.7% | 29.8% | 85.0 Gy |
| Std. | 5.6% | 9.1% | 10.3% | 9.5% | 10.0% | 12.4% | 12.4% | 13.1% | 3.4 Gy | |
In addition, for all 50 testing patient cases, it took the trained VTP engine less than 1 minute to generate the finally optimized treatment plan per patient case with the in-house TPS system. In comparison, it took an experienced human planner around 3 minutes to complete the same planning process with an average score ~8.5 (Shen et al., 2021b), which indicated the high efficiency of the VTP-guided treatment planning.
As for the VTP-guided Eclipse treatment planning, the DVHs of the initial, intermediate and final treatment plans and the corresponding TPP adjustment process for one patient case were illustrated in Figure 6. As is shown, Eclipse generated the initial treatment plan under trivial TPP settings (Figure 6(b) step 0). The plan suffered from hot PTV coverage and scored at 8 under the ProKnow score system. In the subsequent VTP-guided Eclipse treatment planning, the VTP observed the intermediate treatment plans through the established API and decided to reduce the priority of rectum dosing in steps 1-5. The Eclipse inverse treatment planning under the updated TPPs helped reduce the dose to OARs yet it did not help improve the plan score significantly. At step 6, the VTP decided to reduce the upper dose limit of PTV, which helped improve the plan score to 8.7 out of a full score of 9. As for the other patient case, it also started at a plan score of 8 and was improved to 8.4 after VTP-guided treatment planning. Observing the entire treatment planning process for both patient cases, the VTP-based automatic TPP adjustments were quite reasonable and helped improve the plan qualities. It indicated that the VTP established upon the in-house dose-volume constrained TPS was also effective in operating Eclipse TPS for high quality treatment planning for prostate cancer IMRT.
Figure 6.

The illustration of the VTP-guided Eclipse treatment planning process for a representative testing patient case. (a): DVHs for the treatment plans obtained before, after three-steps and after six-steps of TPP adjustments. (b) The specific TPP adjustment made by the VTP.
3.3. Time performance
As mentioned in the method section, when we implemented the in-house TPS on the CPU platform, sparse matrix operations were extremely time-consuming in the entire VTP guided treatment planning. We then reimplemented the TPS system on the CUDA platform via the PYCUDA technique. We compared the time performance of the VTP training before and after the PYCUDA acceleration and showed the results for the top 5 most time-consuming steps in Figure 7. As is shown, time cost for all sparse-matrices-correlated operations (‘CSC matvec’, ‘CSR matvec’, ‘CSR column index2’, ‘CSR column index1’) were significantly reduced with the PYCUDA acceleration. As for the item relevant to Tensorflow operation (‘TF SessionRunCallable’), its execution time was not affected as expected. Overall, the PYCUDA technique improved the running efficiency of the in-house TPS by around 7.1-fold. It reduced the VTP training time from ~80 hours to ~40 hours, increasing the efficiency by about 2-fold. These results indicated that the PYCUDA technique effectively improved the VTP training efficiency.
Figure 7.

The time performance for the top 5 most time-consuming functions in the Q-network training process with incorporating the CPU-based (blue) and PYCUDA-accelerated (red) treatment planning optimization engine, respectively. Here, ‘CSC (CSR) matvec’ represents the multiplication between a CSC (CSR) matrix with a vector, ‘CSR column index’ stands for the column indexing of the CSR matrix, and ‘TF SessionRunCallable’ is a Tensorflow operation.
4. DISCUSSION
We successfully trained a deep reinforcement learning based VTP that could operate both in-house dose-volume constrained TPS and Eclipse TPS for automatic treatment planning in prostate cancer IMRT. We used the ProKnow scoring system to quantify the treatment plan quality and to generate the reward for the VTP-based TPP adjustment. We applied the replay memory, the ε-greedy policy and the PYCUDA technique for effective and efficient VTP training. Among them, the PYCUDA technique successfully reduced the VTP training time from ~80 hours to ~40 hours. After the VTP was trained for 200 episodes with 10 patient cases, we tested it with another 50 patient cases. On average, it took the trained VTP less than 1 minute to operate the in-house TPS to generate a final treatment plan for each case, while it took an experienced human planner around 3 minutes to complete the same planning process (Shen et al., 2021b). The average plan score was improved from 6.18 to 8.14 (full score of 9). The effectiveness of the trained VTP in operating Eclipse TPS for automatic treatment planning was also tested with another two independent cases through API connection. The corresponding plan scores were successfully improved from 8 to 8.4 and 8.7 respectively.
It is worth mentioning that in the dose-volume constrained TPS, we had three adjustable constraints for each organ. With two OARs and one PTV considered in this work, nine adjustable constraints were available while the adjustment decision for each constraint was made by an independent set of deep-neural network. Compared to our previous work that was built upon dose constrained TPS (Shen et al., 2020), the networks employed in this work were almost doubled. Although it is more challenging to train a bigger network, with more TPPs to choose from, the newly-established VTP could make a better TPP adjustment decision in each step and hence be more efficient in obtaining a high-quality treatment plan once well-trained. More importantly, we found the VTP established upon the dose-volume constrained TPS was also effective in operating Eclipse TPS for treatment planning without any further tuning of the network. This inspired us to consider the dose-volume constrained TPS a good approximator of the commercial TPS and develop new intelligent networks upon the dose-volume constraint TPS before adapting to commercial TPS, as it is much more convenient to access an in-house TPS than a commercial TPS.
Except for the above success, we also noticed several limitations in our current work. One problem was that a small portion of patient cases with a low starting plan score (2-4) were not effectively improved after the VTP based treatment plan optimization (Figures 4(a) and 4(b)). One possible reason was that in our current employment of the replay memory technique, the memory buffer was always updated with most recent experiences without differentiating their levels of importance. This could make the agent insufficient in learning those rarely appearing but important TPP adjustment experiences. A potential solution was to employ a more complex case-sampling strategy from the replay memory. In this way, the agent could more frequently ‘saw’ those rarely-appeared but importance experiences and rapidly learnt to make optimal TPP adjustment decisions when facing challenging cases. It is our next step work to explore this technique to improve the VTP performance.
In addition, as discussed in our previous publication (Shen et al., 2021b), under the current IATP framework, the network parameters could increase quickly along with the increase of the number of TPPs. When the set of TPPs was large enough, the training of the network could be extremely challenging and time consuming. To solve this problem, one way was to reduce the network parameters via employing a hierarchy DRL that decomposed the TPP decision process into three subnetworks, which has been realized in (Shen et al., 2021b). Another possible way was to split the treatment planning goal into a sequence of less-challenging sub-goals. We then could organize a multi-level network with each subnetwork only targeting on the corresponding sub-goal. In this way, each subnetwork was expected to be less complex and easy to reach convergence. Specifically, we could employ the hierarchy actor-critic network, inspired by the work of (Levy et al., 2017b; Levy et al., 2017a). We will explore this possibility in our future work.
5. CONCLUSION
We successfully implemented DRL intelligence with Q-learning technique to operate an in-house dose-volume constrained TPS for high-quality treatment planning in prostate cancer IMRT. The established DRL network was also found to be considerably effective in operating commercial Eclipse TPS for high-quality treatment planning. In both situations, the DRL network was able to make reasonable parameter-adjustment decisions when facing given intermediate treatment plans. We consider the in-house dose-volume constrained TPS a good approximator for commercial TPS, which provides a convenient environment to test newly-developed intelligent treatment planning architectures before adapting to commercial TPS.
Figure 2.

(a) The architecture of the deep neural network for the virtual treatment planner, which was composed of nine subnetworks. (b) The structure of a representative subnetwork, which contains 20 hidden layers.
ACKNOWLEDGEMENT
This work was partially supported by the UT System Rising STARs grant and the NIH/NCI grants R01CA237269 and R01CA254377.
REFERENCES
- Atun R, Jaffray DA, Barton MB, Bray F, Baumann M, Vikram B, Hanna TP, Knaul FM, Lievens Y, Lui TY, Milosevic M, O’Sullivan B, Rodin DL, Rosenblatt E, Van Dyk J, Yap ML, Zubizarreta E and Gospodarowicz M 2015. Expanding global access to radiotherapy Lancet Oncol 16 1153–86 [DOI] [PubMed] [Google Scholar]
- Bortfeld T 2006. IMRT: a review and preview Phys Med Biol 51 R363–79 [DOI] [PubMed] [Google Scholar]
- Boyd S 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Foundations and Trends® in Machine Learning 3 1–122 [Google Scholar]
- Chang ATY, Hung AWM, Cheung FWK, Lee MCH, Chan OSH, Philips H, Cheng YT and Ng WT 2016. Comparison of Planning Quality and Efficiency Between Conventional and Knowledge-based Algorithms in Nasopharyngeal Cancer Patients Using Intensity Modulated Radiation Therapy Int J Radiat Oncol Biol Phys 95 981–90 [DOI] [PubMed] [Google Scholar]
- Chanyavanich V, Das SK, Lee WR and Lo JY 2011. Knowledge-based IMRT treatment planning for prostate cancer Med Phys 38 2515–22 [DOI] [PubMed] [Google Scholar]
- Chen W, Unkelbach J, Trofimov A, Madden T, Kooy H, Bortfeld T and Craft D 2012. Including robustness in multi-criteria optimization for intensity-modulated proton therapy Physics in Medicine Biology 57 591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho B 2018. Intensity-modulated radiation therapy: a review with a physics perspective Radiat Oncol J 36 1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watkins CJ. D P. Q-learning Machine learning 1992 [Google Scholar]
- Craft DL, Hong TS, Shih HA and Bortfeld TR 2012. Improved planning time and plan quality through multicriteria optimization for intensity-modulated radiotherapy International Journal of Radiation Oncology* Biology* Physics 82 e83–e90 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fogliata A, Belosi F, Clivio A, Navarria P, Nicolini G, Scorsetti M, Vanetti E and Cozzi L 2014. On the pre-clinical validation of a commercial model-based optimisation engine: application to volumetric modulated arc therapy for patients with lung or prostate cancer Radiother Oncol 113 385–91 [DOI] [PubMed] [Google Scholar]
- Hussein M, Heijmen BJM, Verellen D and Nisbet A 2018. Automation in intensity modulated radiotherapy treatment planning-a review of recent innovations Br J Radiol 91 20180270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hussein M, South CP, Barry MA, Adams EJ, Jordan TJ, Stewart AJ and Nisbet A 2016. Clinical validation and benchmarking of knowledge-based IMRT and VMAT treatment planning in pelvic anatomy Radiother Oncol 120 473–9 [DOI] [PubMed] [Google Scholar]
- Intensity Modulated Radiation Therapy Collaborative Working Group 2001. Intensity-modulated Radiotherapy: Current Status and issues of Interest International Journal of Radiation Oncology* Biology* Physics 51 880–914 [DOI] [PubMed] [Google Scholar]
- Krizhevsky A, Sutskever I and Hinton GE 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems 25 1097–105 [Google Scholar]
- Kubo K, Monzen H, Ishii K, Tamura M, Kawamorita R, Sumida I, Mizuno H and Nishimura Y 2017. Dosimetric comparison of RapidPlan and manually optimized plans in volumetric modulated arc therapy for prostate cancer Phys Med 44 199–204 [DOI] [PubMed] [Google Scholar]
- LeCun Y, Bottou L, Bengio Y and Haffner P 1998. Gradient-based learning applied to document recognition Proceedings of the IEEE 86 2278–324 [Google Scholar]
- Levy A, Konidaris G, Platt R and Saenko K 2017a. Learning multi-level hierarchies with hindsight arXiv preprint arXiv:.00948 [Google Scholar]
- Levy A, Platt R and Saenko K 2017b. Hierarchical actor-critic arXiv preprint arXiv:.00948 12 [Google Scholar]
- Lin L-J 1992. Reinforcement learning for robots using neural networks: Carnegie Mellon University; ) [Google Scholar]
- Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK and Ostrovski G 2015. Human-level control through deep reinforcement learning nature 518 529–33 [DOI] [PubMed] [Google Scholar]
- Shen C, Chen L, Gonzalez Y and Jia X 2021a. Improving Efficiency of Training a Virtual Treatment Planner Network via Knowledge-guided Deep Reinforcement Learning for Intelligent Automatic Treatment Planning of Radiotherapy Med Phys [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C, Chen L and Jia X 2021b. A hierarchical deep reinforcement learning framework for intelligent automatic treatment planning of prostate cancer intensity modulated radiation therapy Phys Med Biol 66 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C, Gonzalez Y, Klages P, Qin N, Jung H, Chen L, Nguyen D, Jiang SB and Jia X 2019. Intelligent inverse treatment planning via deep reinforcement learning, a proof-of-principle study in high dose-rate brachytherapy for cervical cancer Phys Med Biol 64 115013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C, Nguyen D, Chen L, Gonzalez Y, McBeth R, Qin N, Jiang SB and Jia X 2020. Operating a treatment planning system using a deep-reinforcement learning-based virtual treatment planner for prostate cancer intensity-modulated radiation therapy treatment planning Med Phys 47 2329–36 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sutton RS and Barto AG 2018. Reinforcement learning: An introduction: MIT press; ) [Google Scholar]
- Thieke C, Küfer K-H, Monz M, Scherrer A, Alonso F, Oelfke U, Huber PE, Debus J and Bortfeld T 2007. A new concept for interactive radiotherapy planning with multicriteria optimization: first clinical evaluation Radiotherapy Oncology (Williston Park) 85 292–8 [DOI] [PubMed] [Google Scholar]
- Varian. Eclipse Photon and Electron Algorithms Reference Guide. 2014. [Google Scholar]
- Wang J, Hu W, Yang Z, Chen X, Wu Z, Yu X, Guo X, Lu S, Li K and Yu G 2017. Is it possible for knowledge-based planning to improve intensity modulated radiation therapy plan quality for planners with different planning experiences in left-sided breast cancer patients? Radiat Oncol 12 85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W, Purdie TG, Rahman M, Marshall A, Liu F-F and Fyles A 2012. Rapid automated treatment planning process to select breast cancer patients for active breathing control to achieve cardiac dose reduction International Journal of Radiation Oncology* Biology*Physics 82 386–93 [DOI] [PubMed] [Google Scholar]
- Xhaferllari I, Wong E, Bzdusek K, Lock M and Chen JZ 2013. Automated IMRT planning with regional optimization using planning scripts Journal of applied clinical medical physics 14 176–91 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan H, Yin F-F, Guan H-q and Kim JH 2003. AI-guided parameter optimization in inverse treatment planning Physics in Medicine Biology 48 3565. [DOI] [PubMed] [Google Scholar]
- Zhang J, Wang C, Sheng Y, Palta M, Czito B, Willett C, Zhang J, Jensen PJ, Yin FF, Wu Q, Ge Y and Wu QJ 2021. An Interpretable Planning Bot for Pancreas Stereotactic Body Radiation Therapy Int J Radiat Oncol Biol Phys 109 1076–85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X, Li X, Quan EM, Pan X and Li Y 2011. A methodology for automatic intensity-modulated radiation treatment planning for lung cancer Physics in Medicine Biology 56 3873. [DOI] [PubMed] [Google Scholar]
