Actor critic with experience replay‐based automatic treatment planning for prostate cancer intensity modulated radiotherapy

Md Mainul Abrar; Parvat Sapkota; Damon Sprouts; Xun Jia; Yujie Chi

doi:10.1002/mp.17915

. 2025 May 31;52(7):e17915. doi: 10.1002/mp.17915

Actor critic with experience replay‐based automatic treatment planning for prostate cancer intensity modulated radiotherapy

Md Mainul Abrar ¹, Parvat Sapkota ¹, Damon Sprouts ¹, Xun Jia ², Yujie Chi ^1,^✉

PMCID: PMC12258009 PMID: 40450383

Abstract

Background

Achieving highly efficient treatment planning in intensity‐modulated radiotherapy (IMRT) is challenging due to the complex interactions between radiation beams and the human body. The introduction of artificial intelligence (AI) has automated treatment planning, significantly improving efficiency. However, existing automatic treatment planning agents often rely on supervised or unsupervised AI models that require large datasets of high‐quality patient data for training. Additionally, these networks are generally not universally applicable across patient cases from different institutions and can be vulnerable to adversarial attacks. Deep reinforcement learning (DRL), which mimics the trial‐and‐error process used by human planners, offers a promising new approach to address these challenges.

Purpose

This work aims to develop a stochastic policy‐based DRL agent for automatic treatment planning that facilitates effective training with limited datasets, universal applicability across diverse patient datasets, and robust performance under adversarial attacks.

Methods

We employ an actor–critic with experience replay (ACER) architecture to develop the automatic treatment planning agent. This agent operates the treatment planning system (TPS) for inverse treatment planning by automatically tuning treatment planning parameters (TPPs). We use prostate cancer IMRT patient cases as our testbed, which includes one target and two organs at risk (OARs), along with 18 discrete TPP tuning actions. The network takes dose–volume histograms (DVHs) as input and outputs a policy for effective TPP tuning, accompanied by an evaluation function for that policy. Training utilizes DVHs from treatment plans generated by an in‐house TPS under randomized TPPs for a single patient case, with validation conducted on two other independent cases. Both online asynchronous learning and offline, sample‐efficient experience replay methods are employed to update the network parameters. After training, six groups, comprising more than 300 initial treatment plans drawn from three datasets, were used for testing. These groups have beam and anatomical configurations distinct from those of the training case. The ProKnow scoring system for prostate cancer IMRT, with a maximum score of 9, is used to evaluate plan quality. The robustness of the network is further assessed through adversarial attacks using the fast gradient sign method (FGSM).

Results

Despite being trained on treatment plans from a single patient case, the network converges efficiently when validated on two independent cases. For testing performance, the mean $\pm$ standard deviation of the plan scores across all test cases before ACER‐based treatment planning is $6.17 \pm 1.90$ . After implementing ACER‐based treatment planning, $92.29 %$ of the cases achieve a perfect score of 9, with only $6.65 %$ scoring between 8 and 9, and no cases being below 7. The corresponding mean $\pm$ standard deviation is $8.92 \pm 0.29$ . This performance highlights the ACER agent's high generality across patient data from various sources. Further analysis indicates that the ACER agent effectively prioritizes leading reasonable TPP tuning actions over obviously unsuitable ones by several orders of magnitude, showing its efficacy. Additionally, results from FGSM attacks demonstrate that the ACER‐based agent remains comparatively robust against various levels of perturbation.

Conclusions

We successfully trained a DRL agent using the ACER technique for high‐quality treatment planning in prostate cancer IMRT. It achieves high generality across diverse patient datasets and exhibits high robustness against adversarial attacks.

Keywords: artificial intelligence, automatic treatment planning, reinforcement learning

1. INTRODUCTION

Real‐time treatment planning represents a challenging problem in intensity modulated radiotherapy (IMRT) for cancer treatment. The difficulty mainly arises from the complex energy deposition properties of the radiation beam within the patient body, which often requires a careful balance between target dose coverage and normal tissue sparing. This balance can be achieved via inverse treatment planning optimization, which involves setting dose constraints for each organ and target, along with weighting factors to prioritize conflicting dose requirements. However, the optimal value set of these treatment planning parameters (TPPs) to achieve clinical acceptable plan are often initialization condition dependent, such as the patient geometries, fluence map, TPP initialization and so forth. ¹ In state‐of‐the‐art radiotherapy clinics, it requires repeatedly adjusting the TPPs by human planners to refine the inverse treatment planning optimization, which is time‐consuming and poses challenges for real‐time planning.

In the era of artificial intelligence (AI), significant efforts have been made to address challenges in treatment planning. One approach aims to shorten the trial‐and‐error process by generating a relatively “good” initial plan using machine learning, allowing human planners to refine and optimize the plan more efficiently. ² For instance, Li et al. ² successfully employed architectures like ResNet ³ and DenseNet ⁴ to create high‐quality starting plans, thereby accelerating the overall treatment planning process. Another approach bypasses the trial‐and‐error process entirely by using machine learning to directly predict desired 3D dose distributions and corresponding 2D fluence map intensities based on patient image and contour data. ⁵ , ⁶ For example, Vandewinckele et al. ⁵ utilized a U‐Net‐based convolutional neural network (CNN) ⁷ to predict 3D dose distributions from CT scans and the contours of targets and organs. They subsequently applied another U‐Net‐based CNN to predict the 2D fluence map from the 3D dose, with or without patient image and contour data, for lung cancer IMRT cases.

Although these AI techniques have significantly reduced the time required for treatment planning, they often depend on large databases of high‐quality patient data for training due to their supervised or unsupervised learning nature. Additionally, the trained networks are frequently not universally applicable across different institutions due to data heterogeneity, ⁸ creating barriers to their widespread adoption. Furthermore, these developments typically lack adversarial attack testing, despite vulnerabilities being identified in various supervised learning algorithms used in medical applications. ⁹

Reinforcement learning (RL), ¹⁰ which mimics the trial‐and‐error learning process used by humans to achieve their goals, offers a new angle to accelerate treatment planning. Unlike other methods, the trial‐and‐error process in RL generates substantial new data samples that the algorithm can learn from, thereby reducing the dependence on large initial datasets. Recently, the application of deep neural network‐based Q‐learning, specifically deep Q‐networks (DQN), ¹¹ has shown promising results in automating treatment planning. ¹² , ¹³ , ¹⁴ , ¹⁵ , ¹⁶ , ¹⁷ For example, Shen et al. ¹² utilized DQN to develop a network that observed dose–volume histograms (DVHs) and output actions to adjust organ weighting factors in inverse treatment planning for high‐dose‐rate brachytherapy in cervical cancer. The same research group later demonstrated the feasibility of this approach for automatic tuning of TPPs in external beam IMRT for prostate cancer. ¹³ Additionally, Sprouts et al. ¹⁶ extended the DQN‐based virtual treatment planner (VTP) to adjust TPPs compatible with commercial TPS systems, achieving effective treatment planning for prostate IMRT.

Despite these promising advancements, the DQN network has inherent limitations that complicate its application to complex, clinically relevant treatment planning scenarios. In clinical settings, human planners often adjust numerous TPPs for different targets and organs at risk (OARs), resulting in a vast state‐action space that the RL algorithm must explore. In such cases, finding the optimal action concerning the Q function can be costly due to challenges like overestimation of the Q value function and difficulties in balancing exploration and exploitation. ¹⁸ To partially address these issues, Shen et al. introduced a knowledge‐guided network training strategy ¹⁴ and a hierarchical approach ¹⁵ within the DQN framework, demonstrating some success in prostate stereotactic body radiation therapy (SBRT) automatic treatment planning. ¹⁷ However, not all challenges were resolved. Additionally, continuous TPP tuning leads to a continuous action space, where DQN's effectiveness diminishes. Like other deep neural networks, DQNs are also vulnerable to adversarial attacks, ¹⁹ raising concerns about their robustness in clinical applications.

To tackle these challenges, we propose a new RL approach, actor–critic with experience replay (ACER), ²⁰ to automate the treatment planning process. The concept underlying this approach was briefly introduced in our previous conference presentation. ²¹ ACER builds on the advanced actor–critic algorithm (A3C), ²² enhancing data sampling efficiency through experience replay. ²³ In this framework, the actor functions as the policy network, aiming to maximize returns, while the critic assesses the quality of the actor's decisions. This setup inherently facilitates exploration and exploitation, and the policy gradient method allows effective exploration of both discrete and continuous action spaces. Previous studies indicate that A3C can be more resistant to adversarial attacks than DQN. ¹⁹ Based on these observations, we propose applying ACER to develop a training‐efficient, robust, and scalable agent for automatic treatment planning applications.

In the following sections, we will detail our implementation of the ACER algorithm for automating the TPP tuning process in inverse treatment planning, using prostate IMRT as a test case. Followed by it is the evaluation of its performance across different datasets, including its vulnerability to adversarial attacks.

2. METHODS AND MATERIALS

2.1. Overall architecture

The overall architecture of the ACER‐based automatic treatment planning system is similar to that of the DQN‐based system, ¹⁶ as shown in Figure 1. The process begins with the random initialization of the TPPs, which are then input into the TPS system for inverse treatment planning. The quality of the resulting treatment plan is evaluated, and if it does not meet the desired standards, both the plan and the TPPs are fed into the ACER‐based VTP system for TPP tuning. With the updated TPPs, the TPS performs inverse treatment planning optimization again. This iterative process continues until the plan quality meets the required standards or the maximum number of TPP tuning iterations is reached.

The workflow of the ACER‐based automatic treatment planning process. Here, “TPP” represents treatment planning parameter. ACER, actor–critic with experience replay; TPP, treatment planning parameter.

In the following subsections, we will give the design details for the in‐house TPS, the ACER‐based VTP system, the testbed, the plan evaluation system, and the system performance test.

2.2. In‐house treatment planning system (TPS)

We developed an in‐house dose‐volume constraint TPS ¹⁶ following the documentation of Varian's Eclipse. ²⁴ For a treatment planning containing a single target and $N$ OARs, the objective function for the fluence map optimization can be formulated as:

\begin{matrix} m i n \frac{1}{2} | | M x - d_{p} {| |}_{-}^{2} + \frac{λ}{2} | | {(M x - t d_{p})}_{V_{PTV}} {| |}_{+}^{2} \\ + \sum_{i}^{N} \frac{λ_{i}}{2} | | {(M_{i} x - t_{i} d_{p})}_{V_{i}} {| |}_{+}^{2}, \\ s . t . x \geq 0, D_{95 %} (M x) = d_{p} . \end{matrix}

(1)

In this equation, ${| \cdot |}_{-}^{2}$ and ${| \cdot |}_{+}^{2}$ represent the standard $l_{2}$ norms, which compute only the negative and positive elements, corresponding to under‐dose and over‐dose constraints, respectively. The under‐dose constraint is further reinforced that at least $95 %$ of the PTV volume receives the prescription dose. The relative importance of the respective terms is adjusted by the weighting factors $λ$ and $λ_{i}$ . Regarding the other variables in the equation, $M$ represents the dose deposition matrix, $x$ is the beamlet vector, and $d_{p}$ is the prescription dose. $t d_{p}$ and $V_{PTV}$ are upper threshold dose and the percentage of the PTV volume considered for over‐dose constraints, respectively. Similarly, $M_{i}$ , $V_{i}$ , and $t_{i}$ are the corresponding variables used for the overdose constraints on the $i$ th OAR. Voxels included in $V_{PTV}$ and $V_{i}$ always receive a higher dose than those not selected.

In summary, the free TPPs to be tuned in this in‐house TPS are $λ$ , $V_{PTV}$ , $t$ , $λ_{i}$ , $V_{i}$ , and $t_{i}^{d}$ (where $i = 1, 2, …, N$ ), totaling $3 (N + 1)$ parameters. Given a set of TPPs, the optimization problem can be solved using the alternating direction method of multipliers (ADMM).

2.3. Actor critic with experience replay (ACER)‐based virtual treatment planner (VTP) system

2.3.1. Working principle of ACER

ACER is a type of deep reinforcement learning that integrates a deep neural network with actor–critic learning while leveraging the experience replay. ²⁰ This approach has shown superior performance in challenging environments, including the Atari57 game collection. ²⁰

In actor–critic learning, an actor agent generates decisions, while a critic agent evaluates those decisions in the context of a sequential decision‐making problem. This process involves a dynamic environment represented by a series of states $s \in S$ associated with a series of possible actions $a \in A$ . After taking an action $a$ in state $s$ , the state transitions into the next state $s^{'}$ with a probability $P r {s^{'} | s, a}$ , yielding a stepwise reward $r \in R$ . The optimal decision made by the actor agent, or the optimal policy $π$ , can be defined as choosing those series of actions $a$ to maximize the accumulated reward for states $s$ over time $t \in (0, 1, 2 …)$ . The corresponding objective function is: ²⁵

J (π) = \sum_{s} \lim_{t \to \infty} P r {s_{t} = s | s_{0}, π} \sum_{a} π (a | s) R_{s}^{a} .

(2)

Representing the policy $π$ by a network that has parameters $θ$ , the policy parameters $θ$ can be optimized using the policy gradient approach governed by the policy gradient theorem ²⁵ as:

\begin{matrix} \begin{matrix} g & = \nabla_{θ} J (π_{θ}) \\ = \sum_{s} \lim_{t \to \infty} P r {s_{t} = s | s_{0}, π} \sum_{a} \nabla_{θ} π_{θ} (a | s) Q^{π_{θ}} (a, s) \\ = E_{θ} [Q^{π_{θ}} (a, s) \nabla_{θ} \log π_{θ} (a | s)] . \end{matrix} \end{matrix}

(3)

Here, $Q^{π} (a, s)$ is the state‐action value function, serving as an effective evaluation of the policy performance.

In actor–critic learning, $Q^{π} (a, s)$ is estimated by the critic agent. When selecting an action $a$ in a state $s$ at step $t$ , the critic estimates $Q^{π} (a_{t}, s_{t})$ as the expected cumulative reward

E_{s_{t + 1} : \infty, a_{t + 1} : \infty} (\sum_{i \geq 0} γ^{i} r_{t + i} | a_{t} = a, s_{t} = s)

(4)

following policy $π$ . Here, $γ \in [0, 1)$ is the discount factor for future rewards.

In ACER, the actor and critic agents are integrated into a deep neural network with “two heads.” One head gives the policy $π_{θ} (a_{t}, s_{t})$ with network parameters $θ$ , while the other outputs the estimate $Q_{θ_{c}} (a_{t}, s_{t})$ with network parameters $θ_{c}$ . Particularly, ACER designs the policy network to contain two parts, the distribution $f$ , and the statistics $ϕ_{θ} (s)$ of $f$ , thus the policy can be fully represented as $π (\cdot | s) = f (\cdot | ϕ_{θ} (s))$ .

To solve the intrinsic instability that arises from combining online RL with deep neural network, ²² ACER employs a hybrid online and offline training strategy to optimize the network performance. Specifically, it adopts the online asynchronous updating of $θ$ and $θ_{c}$ by launching parallel network learners that interact with different instances of the environment. ²² This technique de‐correlates the agents' data, thereby stabilizing the learning process. Additionally, it implements the offline experience replay strategy, ²³ allowing the agent to learn from memory of experiences. Trajectories can be retrieved from the memory and weighted by importance sampling, promoting both learning stability and data efficiency.

Particular to the training of the critic network, ACER constructs the target $Q$ with retrace estimator ²⁶ as

\begin{matrix} \begin{matrix} Q^{tar} (a_{t}, s_{t}) & = Q^{ret} (a_{t}, s_{t}) \\ = r_{t} + γ {\bar{ρ}}_{t + 1} [Q^{ret} (a_{t + 1}, s_{t + 1}) \\ - Q_{θ_{c}} (a_{t + 1}, s_{t + 1})] + γ V_{θ_{c}} (s_{t + 1}) . \end{matrix} \end{matrix}

(5)

Here, $\bar{ρ} = m i n (c, ρ)$ is the truncated importance weight used in experience replay, where $c$ is a constant and $ρ$ is the importance weight. During online updating, $\bar{ρ} = 1$ . The value function $V_{θ_{c}} (s)$ is derived from the critic's $Q$ estimator as $V_{θ_{c}} (s) = Σ_{a} Q_{θ_{c}} (a | s) f (a | ϕ (s))$ . $Q^{ret}$ has been shown to have low variance and to converge effectively. With it, the critic network parameter $θ_{c}$ is then updated as $d θ_{c} = d θ_{c} + \nabla_{θ_{c}} {(Q^{ret} (a, s) - Q_{θ_{c}} (a, s))}^{2}$ .

The low‐variance $Q^{ret}$ is also employed to stabilize online policy updates by replacing $Q^{π_{θ}}$ term as $Q^{ret} - V$ in Equation (3). ACER further incorporates truncated importance sampling in experience replay with bias correction to enhance data efficiency while avoiding excessive bias in policy updates, which yields the offline policy gradient $g$ with respect to $ϕ$ as follows:

\begin{matrix} g_{t}^{ACER} & = & {\bar{ρ}}_{t} \nabla_{ϕ_{θ} (s_{t})} \log f (a_{t} | ϕ_{θ} (s_{t})) [Q^{ret} (a_{t}, s_{t}) - V_{t} (s_{t})] \\ + \underset{a \sim π}{E} ({[\frac{ρ_{t} (a) - c}{ρ_{t} (a)}]}_{+} \nabla_{ϕ_{θ} (s_{t})} \log f (a_{t} | ϕ_{θ} (s_{t})) \\ \times [Q_{θ_{c}} (a, s_{t}) - V_{θ_{c}} (s_{t})]) . \end{matrix}

(6)

Finally, to limit the per‐step changes to the policy and achieve stability, ACER provides the option to utilize a modified version of trust region policy optimization (TRPO), ²⁷ ensuring that the updated policy does not deviate significantly from the average policy network $ϕ_{a}$ . Specifically, it restricts the policy network parameter $θ$ updating at step $t$ as

d θ = d θ + \frac{\partial ϕ_{θ} (s_{t})}{\partial θ} (g_{t}^{ACER} - \max \{0, \frac{k^{T} g_{t}^{ACER} - δ}{{| | k | |}_{2}^{2}}\} k) .

(7)

Here, $k = \nabla_{ϕ_{θ} (s_{t})} D_{K L} [f (\cdot | ϕ_{θ_{a}} (s_{t})) | | f (\cdot | ϕ_{θ} (s_{t}))]$ is the linear Kullback–Leibler divergence ( $D_{K L}$ ) and $δ$ is the predefined divergence constraint. In this situation of disabled TRPO updating, the policy network parameter is updated as

d θ = d θ + \frac{\partial ϕ_{θ} (s_{t})}{\partial θ} g^{ACER} .

(8)

In practice, entropy regularization term $β \nabla_{θ} H (π (s, θ))$ ²² can also be used to boost online and offline policy update performance, with $β$ the weighting parameter. This term improves exploration by discouraging premature convergence to sub‐optimal deterministic policies.

With variables $d θ_{c}$ and $d θ$ computed, the network parameters $θ_{c}$ and $θ$ are updated with RMSprop algorithm. ²⁸ Under active TRPO updating, the parameters $θ_{a}$ for the average policy network $ϕ_{a}$ is updated as $θ_{a} = α θ_{a} + (1 - α) θ$ , with $α$ the average model decay rate.

2.3.2. Establishment of the VTP system upon ACER architecture

We apply the ACER architecture to develop the VTP system for automatic treatment planning as follows. We define an input state $s$ as discrete points taken from the DVH curves of a treatment plan. For a treatment plan containing $M$ target and OARs, the input state $s$ has a dimension of $m \times M$ , where $m$ represents the number of discrete points on each DVH curve. The action space consists of tuning strategies for the TPPs. In this initial approach, we allow each TPP to have two tuning strategies: increasing or decreasing by a predefined amount. With $M$ target and OARs, there are $3 M$ TPPs to tune, resulting in an action space of length $6 M$ . This defines the dimensions of both the policy distribution and the $Q$ ‐value function space, each with length $6 M$ .

Once a TPP tuning action $a$ is predicted for treatment plan $s$ at time $t$ , the in‐house TPS performs inverse treatment planning to generate a new plan $s_{t + 1}$ . For each state‐action pair $(s_{t}, a_{t})$ , the immediate reward $r$ is calculated as the difference in plan quality between the new plan and the current plan, that is, $r_{t} = ψ (s_{t + 1}) - ψ (s_{t})$ , where $ψ (s)$ represents the quality evaluation of the plan $s$ . The total reward is computed as $R = \sum_{i = t}^{t_{epi} - 1} γ^{i - t} r_{i}$ , reflecting the accumulated return across all future plans in that episode.

Training of the ACER‐based VTP follows the standard ACER training process. We launch $N_{a}$ parallel online training agents, each with distinct input treatment plans. The agents asynchronously update the network policy upon completing an episode with a maximum step length of $t_{epi}$ . After each episode, each agent is restarted with a different input treatment plan. During the process, each agent stores its episodic trajectories in a replay buffer with a storage length of $t_{rep}$ steps. Once $t_{s}$ steps have been accumulated, offline policy training examines the experience pool using a batch size of $t_{B}$ steps and begins to update the policy network at a frequency $p - 1$ times higher than that of the online updates. To consistently monitor training performance, we evaluate the process every $t_{eval}$ steps using independent evaluation patient cases. The training continues until a maximum of $T_{\max}$ steps is reached.

2.3.3. Testbed

In line with our previous development efforts, ¹⁶ this paper continues to use IMRT treatment planning for prostate cancer as a testbed to test the proposed ACER‐based VTP system. We consider a scenario involving one target (the prostate) and two OARs (the bladder and the rectum), leading to the optimization of three DVH curves. Each curve is represented by 100 discrete points, resulting in a total of 300 floating values as the input to the ACER agent. According to Equation (1), we have 9 TPPs to tune, creating an action space of 18, as shown in Table 1. The specific increment and decrement amplitudes are determined based on experience and are not expected to significantly affect the overall convergence of the treatment planning process. F

TABLE 1.

The actions to tune the TPPs in step $t$ based on the TPP values in step $t - 1$ .

Actions

λ_{PTV, OAR}^{t}

t_{PTV}^{t}

t_{OAR}^{t}

V_{PTV}^{t}

V_{OAR}^{t}

Increase

1.65 λ_{PTV}^{t - 1}, OAR

\min (1.2

1.01 t_{PTV}^{t - 1})

\min (1

1.25 t_{OAR}^{t - 1})

\min (0.3

1.25 V_{PTV}^{t - 1})

\min (1

1.25 V_{OAR}^{t - 1})

Decrease

0.6 λ_{PTV}^{t - 1}, OAR

\max (1

0.91 t_{PTV}^{t - 1})

0.6 t_{OAR}^{t - 1}

0.8 V_{PTV}^{t - 1}

0.8 V_{OAR}^{t - 1}

Open in a new tab

Note: Here, “OAR” represents both rectum and bladder.

Abbreviations: OAR, organ at risk; TPPs, treatment planning parameters.

We use the ProKnow scoring system (ProKnow Systems, Sanford, FL, USA) for prostate cancer IMRT plans to estimate the plan quality $ψ (s)$ . Nine original scoring criteria from the ProKnow system relevant to our testbed are shown in Table 2. In practice, we slightly adjust these criteria ¹⁴ and apply them to compute $ψ (s)$ for each plan $s$ . With equal weighting on each criterion, the score ranges from 0 to 9.

TABLE 2.

The nine criteria originated from the planIQ scoring system relevant to this study.

Bladder

Rectum

PTV

V_{100.6 %}

V_{94.3 %}

V_{88.1 %}

V_{81.2 %}

V_{94.3 %}

V_{88.1 %}

V_{81.2 %}

V_{75.5 %}

D_{0.03 c c}

< 20 %

< 30 %

< 40 %

< 55 %

< 20 %

< 30 %

< 40 %

< 55 %

< 110 %

Open in a new tab

For this testbed, the specific network architecture of the ACER‐based VTP system is shown in Figure 2. The input DVH data pass through a fully connected layer with a hidden size of 32, followed by a Rectified Linear Unit (ReLU) activation layer. They then enter a long short‐term memory (LSTM) layer with a hidden size of 32. Afterward, the data are split: one path goes through a fully connected layer followed by a softmax layer, which outputs the policy, while the other path flows into a separate fully connected layer to produce the $Q$ value function. The implementation of the network relies on the pytorch and Open AI Gym libraries. The learnable parameters are distributed across the three fully connected layers and the LSTM cell, totaling around 20,000 parameters.

The deep neural network featuring a “two‐head” structure designed for the ACER‐based VTP. It takes the DVH of the current plan as input. It outputs a TPP tuning strategy across 18 actions in one head, and produces the corresponding Q‐value in the other head. ACER, actor critic with experience replay; DVH, dose volume histogram; TPP, treatment planning parameter; VTP, virtual treatment planner.

2.3.4. Datasets

We use three independent datasets of prostate cancer IMRT cases for network training, validation and testing. The first dataset contains 52 independent patient treatment plans, as described in our previous development. ¹³ , ¹⁶ The second dataset is the Common Optimization for Radiation Therapy (CORT) dataset from the Mass General Radiation Oncology Physics Division, which is publicly available and contains one independent patient case. ²⁹ The third dataset is The Radiotherapy Optimization Test Set (TROTS), which includes 30 independent patient cases. ³⁰

The three datasets differ in anatomical detail and available beam configurations. Datasets 1 and 2 provide dose deposition matrices from 180 beam angles (every 2 degrees) covering the entire PTV and OAR volumes. In contrast, Dataset 3 includes dose deposition matrices sampled at specific points within the PTV and OAR volumes for 25 unique combinations of gantry, couch, and collimator angles (Table 3). In this study, we selected a subset of beam angles from the 180 available angles for patient cases in Datasets 1 and 2, while utilizing all 25 available beam configurations from Dataset 3 for network training and testing. Regarding anatomical differences, Dataset 1 generally has higher voxel counts for the PTV, bladder, and rectum compared to Datasets 2 and 3, whereas Dataset 2 has the largest bladder‐to‐PTV overlap. Since only sampled anatomical volumes are available in Dataset 3, it has the smallest voxel counts, especially for OARs, as well as the least overlap between OARs and the PTV. Detailed statistics, including voxel counts for the PTV and OARs, their relative overlaps, and the number of beamlets for each dataset and beam configuration, are summarized in Table 4. Overall, these datasets collectively provide a rich and diverse resource for training and evaluating our ACER‐based agent.

TABLE 3.

Beam settings (gantry, couch, and collimator angles in degrees) for dataset 3.

Gantry	12.1	20.6	20.9	21.7	43.1	43.1	54.6	58.3	66.2	78.1	90.0	95.9	258.4
Couch	330.5	61.7	0.0	90.0	17.2	342.8	19.3	0.0	339.1	339.1	0.0	328.8	0.0
Collimator	88.8	24.4	0.0	0.0	4.2	8.5	24.7	89.1	4.1	8.1	25.7	113.0	17.2
Gantry	264.1	281.6	281.9	288.0	293.5	293.8	298.2	298.2	309.6	309.6	316.9	347.9
Couch	31.2	0.0	20.9	349.9	43.3	20.9	10.8	349.2	7.6	352.4	342.8	29.5
Collimator	29.1	69.3	8.1	41.3	13.4	87.9	41.3	55.9	37.8	17.2	16.9	69.0

Open in a new tab

TABLE 4.

Voxel and beamlet numbers, as well as organ overlap percentages, for the training patient case and the test patient cases in Datasets 1, 2, and 3.

Testing

Training

Dataset1

Dataset2

Dataset3

V_{PTV}

7466

9266 \pm 3002

6770

4954 \pm 18

V_{BLA}

1513

16506 \pm 11529

11592

4975 \pm 11

V_{REC}

2026

4396 \pm 1343

1760

4944 \pm 27

V_{BLA} \cap V_{PTV} / V_{PTV}

(%)

4.4

7.0 \pm 3.1

25.7

0.1 \pm 0.1

V_{REC} \cap V_{PTV} / V_{PTV}

(%)

3.9

1.7 \pm 1.1

1.7

0.0 \pm 0.0

N_{Beamlets}

(7 beams)

2800

2742 \pm 807

948

N/A

N_{Beamlets}

(new 7 beams)

N/A

2912 \pm 652

996

N/A

N_{Beamlets}

(6 beams)

N/A

2496 \pm 559

848

N/A

N_{Beamlets}

(25 beams)

N/A

2937 \pm 333

Open in a new tab

For the specific training, validation, and testing setup, we trained our model using one patient case from Dataset 1 with beam angles set at 0 Inline graphic , 32, 64, 96, 264, 296, and 328, and validated the model using two additional independent patient cases from Dataset 1 with the same angle configuration. Six different test groups were created for testing, as summarized in Table 5. Test group 1 comprises 49 patient cases from Dataset 1 that are independent of the training and validation sets. For this group, we used fixed TPP initialization (all values set to 1, except for $V_{PTV} = 0.1$ ) to generate the initial treatment plans, consistent with the initialization approach from the previous DQN‐based treatment planning study, ¹⁶ thus allowing direct comparison of performance between the two network architectures. Test groups 2 through 6 were generated using random TPP initialization, motivated by two main considerations: first, random initialization enables testing across diverse state‐action‐reward trajectories by creating a wide variety of initial treatment plans; second, it simulates variations in TPP settings across different radiation oncology clinics.

TABLE 5.

Summary of the testing groups, including the number of plan cases, datasets, beam configurations, and TPP initialization methods.

Test Group

N_{Plans}

Dataset

Beam Configuration

TPP Initialization

Same as training setup

Fixed

147

Same as training setup

Random

Same as training setup

Random

1 & 2

new 7‐beam (26 Inline graphic

, 76

, 130

, 180

, 230

, 286

, 336

)

Random

1 & 2

6‐beam (0

, 60

, 120

, 180

, 240

, 300

)

Random

Dataset 3 specific

Random

Open in a new tab

Abbreviation: TPP, treatment planning parameter.

2.4. Adversarial attack

After completing the training of the ACER agent, except for testing its performance on TPP tuning decisions, we also evaluate its robustness against adversarial attacks. Adversarial attacks are malicious attempts to manipulate machine learning models into making incorrect predictions or decisions. ³¹ , ³² Given the potential clinical application of automatic treatment planning agents in the future, it is crucial to train the ACER agent to be robust against such attacks. ⁹

We assume the adversary has access to the trained policy network, allowing it to fool the network by perturbing the input state in a way that exploits the policy's sensitivity. Specifically, we use the fast gradient sign method (FGSM) ³³ to compute the state perturbation $η$ , which is the gradient of the loss function $J$ with respect to the state $s$ :

η = ε sign (\nabla_{s} J (θ, s, y)) .

(9)

Here, $ε$ is a constant, serving as the upper limit for the element‐wise perturbation, i.e., ${| η |}_{\infty} \leq ε$ . $y$ is the distribution over all possible actions. With this, the perturbed state becomes $s^{'} = s + η$ .

We apply the FGSM‐based attack to both the ACER agent trained in this work and the DQN agent from our previous study, ¹⁶ and compare their robustness to the attack. In the ACER agent, $y$ is the stochastic policy $π$ . $J (θ, s, y)$ is represented as the cross entropy loss between $y$ and the distribution that places all weight on the highest‐weighted action $a_{j}$ in $y$ . Specifically

J (θ, s, π) = - \frac{1}{18} (\log (π^{θ} (a_{j} | s)) + \sum_{i \neq j} \log (1 - π^{θ} (a_{i} | s))) .

(10)

In the DQN agent, the policy determined by the Q value function is deterministic, which causes the problem that the gradient of $J (θ, s, y)$ is almost zero for all input states. To solve it, we define $y$ as the softmax of the $Q$ value function. ¹⁹ We set $ε$ to be values of 0.001, 0.01, and 0.1, apply the corresponding attacks and record their perturbations on action priorities and the next‐step treatment plan qualities.

3. RESULTS

3.1. Training results

The hyperparameter values used to configure the ACER‐based VTP for prostate cancer IMRT treatment planning are listed in Table 6. We utilize 3 CPU cores for online asynchronous training, aiming to complete the entire training within 250,000 steps. To encourage early convergence in the treatment planning process, we limit the length of each episode to 20 steps. Offline training begins with the storage of 2000 steps of experiences, with a total maximum experience storage capacity of 100,000 steps. The TRPO‐based updating method has been found to be computationally expensive; therefore, we disable it in this study to simplify the network. The values for the remaining parameters in Table 6 are consistent with those used for training the ACER agent in the Atari57 game set. ²⁰ The entire training process takes approximately 7 days on an Intel(R) Core(TM) i7‐6850K CPU @ 3.60GHz.

TABLE 6.

The hyperparameters and their values used to train the ACER‐based VTP for IMRT treatment planning of prostate cancer.

Hyperparameter

Value

Description

N_{a}

Number of asynchronous training agents

T_{\max}

250,000

Number of total training steps

t_{epi}

Maximum length of an episode

t_{rep}

100,000

Storage capacity of experience replay memory

t_{s}

2000

Number of accumulated transitions before starting off‐policy training

t_{B}

Off‐policy batch size

c

Importance‐weight truncation in experience replay

p

Ratio of off‐policy to on‐policy updates

t_{eval}

500

Interval between two adjacent evaluations

γ

0.99

Discount factor

β

0.001

Weighting for entropy regularization

α

disabled

Decay rate for the average policy model

δ

disabled

Trust region threshold value

Open in a new tab

Abbreviations: ACER, actor critic experience replay; IMRT, intensity modulated therapy; VTP, virtual treatment planner.

The convergence map of the agent training process, evaluated based on the average plan score for the validation patient cases, is shown in Figure 3. As illustrated, the plan score gradually approaches the maximum value of 9 as training progresses, with reduced fluctuations until approximately 200,000 steps. Beyond this point, performance becomes unstable, exhibiting large fluctuations, which we interpret as overfitting to the training cases. Therefore, we select the policy obtained at an earlier convergence point, around step 120,500, for testing.

The convergence map of the ACER‐based virtual treatment planner VTP training process, evaluated based on the average plan score for the validation patient cases. ACER, actor critic with experience replay; VTP, virtual treatment planner.

3.2. Testing results

The results for test group 1 are shown in Figure 4a,c. In Figure 4a, the patient cases are grouped into eight categories based on their plan scores (group index $i \in (1, 2, …, 8)$ includes cases with plan scores in the range of $[i, i + 1)$ ). Before ACER‐guided treatment planning, plan scores are distributed broadly from 2 to 9, with a mean score and standard deviation (std.) of $6.20 \pm 2.01$ . After ACER‐guided treatment planning, 42 out of 49 cases achieve a full score of 9, 1 case reaches 8.9, 3 cases score 8, and 3 cases score between 7 and 8. The corresponding mean and std. are $8.85 \pm 0.43$ . In comparison, DQN‐guided treatment planning improves the plan score from $6.18 \pm 1.75$ to $8.14 \pm 1.27$ for 50 patient cases from the same patient dataset. ¹⁶

(a,b) The plan score distributions for 49 test cases generated under trivial treatment planning parameter (TPP) settings and 147 test cases generated under random TPP settings from 49 patient cases in dataset 1, respectively, before and after actor critic with experience replay (ACER)‐guided treatment planning. The histogram width is set to 1. (c,d) The mean and standard deviation of the plan score distributions before and after ACER‐based treatment planning for the cases shown in (a) and (b), respectively. The groups in (c) and (d) correspond one‐to‐one with the histogram distributions in (a) and (b). ACER‐based treatment planning significantly improves plan quality, achieving a mean score close to 9 across all plan groups. ACER, actor critic with experience replay, TPP, treatment planning parameter.

In Figure 4c, the same 49 patient cases are divided into 8 groups based on their initial plan scores. The mean and standard deviation of the plan scores for each group, both before and after ACER‐based treatment planning, are plotted. It is evident that after ACER‐guided treatment planning, the plan scores are uniformly improved, approaching 9 across all patient groups, including those with very low initial scores below 3 (patient group 1). In contrast, DQN‐based treatment planning shows that some patients with low initial scores could not be efficiently improved (as depicted in Figure 5a in Sprouts et al. ¹⁶ )

(a–c) The dose colorwash for a representative test case in step 0, step 10, and final step 16, respectively, under actor critic with experience replay (ACER)‐based automatic treatment planning. The contours are represented in black for the prostate target, blue for the rectum, and green for the bladder. In the color bar, color “1” corresponds to the prescription dose level. (d) The corresponding dose volume histograms (DVHs) for steps in (a–c). (e) The TPP tuning choices made by ACER for each planning step. (f) The corresponding plan scores over the 16 steps (the maximum score is 9). ACER, actor critic with experience replay; TPP, treatment planning parameter.

We illustrate how ACER‐based VTP observes an intermediate treatment plan and makes the TPP adjustment decision for a representative testing case in Figure 5. As is shown in Figure 5a,d, at the initial step, the plan fails to spare the bladder, partially fails to spare the rectum, and has a hotspot in the PTV, resulting in a low initial plan score of 2. The VTP observes this plan and decides to lower the threshold dose value for the bladder ( $t_{BLA}$ ) in the first step, which improves the plan score to 4 by fully sparing the bladder volume. It then continues to enhance rectum sparing by lowering the threshold value for the rectum ( $t_{REC}$ ), However, these adjustments result in an even hotter PTV. To address this issue, over the next 14 steps, the ACER‐based VTP reduces the priorities for the OARs ( $λ_{REC}$ and $λ_{BLA}$ ), lowers threshold dose value in PTV ( $t_{PTV}$ ), and increases the PTV priority ( $λ_{PTV}$ ) until reaching a score of 9. These actions relax OAR constraints while tightening PTV constraints, mirroring the adjustments a human dosimetrist would make. This indicates that the ACER‐based agent exhibits a human‐like approach to TPP tuning.

We then expanded our evaluation to test groups 2 through 6, highlighting variations in random TPP initialization, distinct beam configurations, and differences in patient anatomy compared to the training set and test group 1. Representative treatment planning cases are presented in Figure 6, and statistical results are summarized in Table 7. Additionally, results for test group 2 are shown in histogram format in Figure 4b,d to provide a side‐by‐side comparison with test group 1.

Representative treatment planning cases from the training and test groups. The first three columns show dose color wash maps from the initial treatment plan, an intermediate step, and the final treatment plan generated by the automatic treatment planning process guided by the ACER‐based agent. The last column presents the corresponding DVHs for these three steps. ACER, actor‐critic with experience replay; DVHs, dose‐volume histogram.

TABLE 7.

Statistics of the treatment planning results for the six test groups (Table 5).

Test groups

{Score}_{i}

{Score}_{f}

P_{9}

(%)

N_{Steps}

6.2

\pm

2.0

8.9

\pm

0.4

85.7

18.8

\pm

7.3

6.3

\pm

1.6

8.9

\pm

0.3

91.8

19.2

\pm

5.5

3.9

\pm

0.3

9.0

\pm

0.0

100.0

18.9

\pm

2.4

4.2

\pm

1.0

9.0

\pm

0.0

100.0

18.6

\pm

4.5

4.2

\pm

1.0

8.8

\pm

0.4

80.0

21.9

\pm

5.4

8.0

\pm

0.1

9.0

\pm

0.2

95.6

16.4

\pm

6.7

Open in a new tab

Note: Reported are mean $\pm$ std of initial ( ${Score}_{i}$ ) and final ( ${Score}_{f}$ ) treatment plan scores, percentage reaching full plan score ( $P_{9}$ ), and mean $\pm$ std of planning steps ( $N_{Steps}$ ).

In Figure 4b, the patient cases are grouped into 7 categories based on their plan scores, using the same method as in Figure 4a. Compared to the initial treatment plans generated under fixed TPP initialization (Figure 4a), treatment plans generated using random TPP initialization exhibit an obviously different score distribution. Before ACER‐guided treatment planning, the mean and std. of the treatment plans are $6.33 \pm 1.65$ . After planning, 135 out of 147 cases achieve a full score of 9, and 12 cases reach a score of 8, with no cases scoring below 8. The corresponding mean and std. are $8.92 \pm 0.27$ . The corresponding patient group distribution is shown in Figure 4d, which also shows that ACER‐guided treatment planning improve the plan score uniformly across all patient groups.

In test group 3, characterized by significantly higher bladder‐to‐PTV overlap compared to the training cases (Table 4), the initial plan scores are notably low, averaging $3.9 \pm 0.3$ . Despite this challenge, the ACER‐based agent successfully elevated the treatment plan scores to the maximum value of 9 for all 30 test cases.

In test groups 4 and 5, we specifically evaluated the ACER agent's performance on automatic treatment planning with beam‐angle configurations different from those used in training. For each group, initial treatment plans were generated by randomly selecting 15 patient cases from Dataset 1 (one plan per patient) and generating 15 additional plans from Dataset 2, resulting in an average initial plan score of $4.2 \pm 1.0$ across both groups. The ACER agent performed robustly with the new 7‐beam configuration (test group 4), successfully elevating all cases to a score of 9. However, it faced slight challenges with the six‐beam configuration (test group 5), likely due to the intrinsic complexity of optimizing fewer beam angles. Although test group 5 represented the lowest performance among the test groups, the agent still managed to elevate 10 out of 15 cases from Dataset 1 and 14 out of 15 cases from Dataset 2 to a score of 9. The remaining cases achieved a score of 8, with the exception of one case from Dataset 1, which attained a score of 7.34.

Lastly, test group 6 emphasizes substantial differences in beam configurations and anatomical characteristics compared to both the training cases and previous test groups. In this group, initial plans frequently achieved a score of 8, primarily due to the relatively small overlap between OARs and the PTV. Although all OARs were adequately spared, the PTV dose distributions were generally hotter than those observed in other datasets. Following ACER‐guided treatment planning, the network successfully elevated 95.6% of these cases to a score of 9.

Combing all test cases, the mean $\pm$ std. of the plan score distributions before ACER‐based treatment planning is $6.17 \pm 1.90$ . After implementing ACER‐based treatment planning, $92.29 %$ of the cases achieve a perfect score of 9, with only $6.65 %$ scoring between 8 and 9, $0.01 %$ scoring between 7 and 8, and no cases scoring below 7. The mean $\pm$ std. of the final scores is $8.92 \pm 0.29$ .

For all testing groups, the network demonstrates consistent performance regarding the number of TPP tuning steps required, with only a slight increase observed in more challenging planning scenarios (e.g., test group 5) (Table 7). This highlights the stability and robustness of the trained ACER‐based agent in automatic treatment planning. However, differences in dose deposition matrix sizes lead to varying computation times during the inverse optimization process performed by the TPS. At each step, the ACER‐based VTP agent rapidly selects an action upon receiving DVH input, typically on the order of $10^{- 4}$ seconds. Nonetheless, the inverse treatment planning time under a given set of TPPs currently ranges from 3 to 28 seconds using our in‐house TPS. We anticipate that integrating this VTP with a more efficient TPS could significantly enhance the overall efficiency of the automatic treatment planning process.

Finally, since the ACER‐based treatment planning agent utilizes a stochastic policy, it is important to understand its policy behavior to ensure stability in guiding treatment planning. To investigate this, we identify five common cases from ACER‐guided and DQN‐guided treatment plannings, each with an initial plan score of 5 due to the failure in rectum dose sparing. The mean and standard deviation of the policy distributions and Q‐value distributions for the leading 18 actions in these cases are shown in Figure 7a,b, respectively. As expected, both networks exhibit relatively stable preferences for certain actions when faced with similar input plans. However, it is noteworthy that ACER strongly prioritizes the reasonable action for rectum sparing, that is to “decrease $t_{REC}$ ,” with a magnitude of order higher than other actions. It also significantly suppresses non‐reasonable actions like “increase $t_{REC}$ ” or “decrease $λ_{REC}$ ,”, which are several orders of magnitude lower than the leading action. The sub‐leading actions, such as “decrease $t_{BLA}$ ” or “increase $λ_{REC}$ ,” are either reasonable or have unclear effects. This type of policy distribution allows the agent to explore the action space while maintaining effectiveness. In contrast, DQN does not significantly differentiate between reasonable and non‐reasonable actions. In fact, the leading two actions in DQN have contradictory effects on rectum dose sparing. This further highlights the superior performance of the ACER‐based VTP agent.

Policy distributions from the actor critic with experience replay (ACER) agent (top row) and Q value distributions from the deep Q network (DQN) agent ¹⁶ (bottom row) for five patient cases with similar plan qualities. In all patient cases, the plans fail to spare the rectum with losing all 4 credits for the 4 rectum dose‐volume criteria as shown in Table 2. ACER, actor critic with experience replay; DQN, deep Q network.

3.3. Adversarial attack

We randomly choose 30 treatment plans and apply one FGSM attack under each value of $ε$ for both ACER and DQN networks. With three $ε$ values of 0.001, 0.01, and 0.1, a total of 180 attacks are performed. The results are shown in Figure 8 and Table 8.

The illustration of the attack effect from FGSM on the DQN ¹⁶ and the ACER agent based treatment planning. (a) The DVH distributions before and after the perturbation with $ε = 0.001$ . The corresponding DVH distributions generated by the in‐house TPS under the tuned TPPs by DQN (b) and ACER‐based agent (c) before and after the perturbation. ACER, actor critic with experience replay; DQN, deep Q network; DVH, dose volume histogram; FGSM, fast gradient sign method; TPP, treatment planning parameter; TPS, treatment planning system.

TABLE 8.

The statistical performance of ACER‐ and DQN‐based agents under three levels of adversarial attacks.

ACER

DQN

ε = 0.001

success rate

0.0 %

30.0 %

Δ P_{1}

- 0.194 % \pm 0.25 %

- 18.55 % \pm 7.73 %

ε = 0.01

success rate

3.3 %

100 %

Δ P_{1}

0.08 % \pm 10.24 %

- 87.95 % \pm 17.98 %

ε = 0.1

success rate

53.3 %

93.3 %

Δ P_{1}

- 36.81 % \pm 46.49 %

- 68.99 % \pm 68.94 %

Δ P_{12}

67.08 % \pm 21.30 %

7.19 % \pm 5.19 %

Open in a new tab

Abbreviations: ACER, actor critic with experience replay; DQN, deep Q networks.

An illustration of the attack effect with perturbation to the input states at the $ε = 0.001$ level is shown in Figure 8. Figure 8a shows the DVH distributions before and after the perturbation with $ε = 0.001$ , which is not distinguishable by naked eyes. The initial plan has a score of 3, which partially fails to spare the rectum and the bladder (both lose 3 points following criteria in Table 2). Before the perturbation is applied, DQN produces a $Q$ ‐value distribution that maximizes the action “decrease $t_{REC}$ ,” which improves the plan score to 4. However, after the perturbation, the Q value for the initial top action is reduced by $16.52 %$ , and the optimal action changes to “increase $λ_{REC}$ ,” slightly decreasing the plan score to 2.26. The treatment plans generated under original action and perturbation changed action are shown in Figure 8b. In contrast, for the same patient case, ACER's policy distribution prioritizes “decreasing $t_{BLA}$ ” and “decrease $t_{REC}$ ” as the top two leading actions, with probabilities of 0.86 and 0.19, respectively (the sum of all action probabilities is 1). The ACER‐based stochastic policy selects “decrease $t_{BLA}$ ,” which improves the plan score to 5. After the perturbation is applied, the policy distribution retains the same ranking of actions, with the probabilities of the leading two actions changing by $- 0.05 %$ and $0.35 %$ , respectively. The new treatment plans generated are shown in Figure 8c. This demonstrates ACER's stable performance under adversarial attack.

The statistical results for all 180 attacks are shown in Table 8. As the perturbation level increases from $ε = 0.001$ to $ε = 0.1$ , the attack success rate increases from $0.0 %$ to $53.3 %$ for the ACER agent, while it rises from $30.0 %$ to $93.3 %$ for the DQN agent. This indicates that the ACER network demonstrates greater robustness compared to the DQN. To further understand the effects of the attacks, we analyzed changes in action probabilities for ACER and Q values for DQN under various attack levels. We compared these with the differences in probabilities and Q values between the top two leading actions. The analysis reveals that the probability difference between the top two leading actions averages at $67.08 % \pm 21.30 %$ for the ACER agent. Comparing to it, the mean probability change for the leading action in ACER is negligible at $ε = 0.001$ and 0.01 levels. In contrast, for the DQN agent, the mean Q value changes at all perturbation levels are significantly greater than the Q value difference between the top two leading actions. This behavior further explains the relative robustness of the ACER agent against FGSM attacks.

4. DISCUSSION

In summary, we have developed an ACER‐based VTP agent that can effectively guide the in‐house TPS for inverse treatment planning with prostate IMRT patient cases as testbed. Despite being trained on a single patient case with random initializations of TPPs, the network has demonstrated superior test performance compared to our previously developed DQN‐based VTP agent. ¹⁶ It has also demonstrated strong generalization, performing well on data from distinct resources, and has exhibited stable performance under adversarial attacks. This raises the question: why does ACER perform so much better?

First, ACER is stochastic policy based, allowing it to explore a wider solution space without suffering from scaling issues typically associated with DQN. The application of the entropy regularization term during the policy update promotes global convergence. Additionally, ACER incorporates various strategies for effectively managing bias and variance during policy updates. These attributes likely contribute to the network's improved performance in treatment planning. Further investigations are needed to explore how these features impact the network's efficacy.

Second, the greater robustness of the ACER agent compared to the DQN agent in the FGSM attack can be understood from two perspectives. First, the stochastic policy strategy of ACER contributes to better convergence compared to DQN. This aligns with findings that show A3C outperforms DQN in adversarial attacks on Atari games. ¹⁹ Second, unlike tasks such as Atari, which typically have only one optimal action, inverse treatment planning often involves multiple reasonable TPP tuning actions that can enhance plan quality. A well‐trained ACER network that effectively prioritizes these reasonable TPP tuning strategies over less suitable options can contribute to its stable performance, even under adversarial attacks.

In this study, we utilize an in‐house TPS for inverse treatment planning during ACER agent training, which may raise concerns about the ACER agent's ability to operate a commercial TPS for high‐quality treatment planning. However, we have previously demonstrated the high efficacy of a DRL agent trained with our in‐house TPS in operating commercial TPS to achieve high‐quality treatment planning. ¹⁶ This suggests that the current ACER‐based agent, trained with the same in‐house TPS, can also function effectively with commercial TPS. We plan to evaluate the performance of the ACER agent on commercial TPS in future work.

Additionally, we use a relatively simple plan quality evaluation system in developing the ACER agent, which may result in differences in plan quality compared to those assessed under clinical evaluation systems. However, we want to emphasize that the primary goal of this work is to demonstrate the efficacy of the ACER agent in treatment plan parameter tuning under a specific reward system. We do not anticipate convergence issues under different evaluation systems.

Another limitation is that we utilize the computationally‐efficient FGSM adversarial attack to test the robustness of ACER‐based VTP agent. Although this method has shown effectiveness in fooling A3C network on other tasks, ¹⁹ it may not be effective in attacking this VTP agent where multiple suitable actions exist. In our future work, we will test the robustness of this VTP agent by employing stronger adversarial attacks.

Finally, in this initial study, we employ the ACER network with a discretized action space, consistent with our previous DQN‐based development efforts. However, to make this tool practical for treatment planning in real clinical settings, it is essential to train a DRL agent capable of tuning each TPP in a continuous space, and ideally, tuning multiple TPPs in a single planning step. Based on the results of this study, we have identified the potential of ACER to produce a prioritized TPP tuning space that could facilitate multi‐TPP tuning in one step. To advance this approach, we may need to further optimize the TPP tuning space to prioritize reasonable TPP strategies effectively. This could be achieved by enhancing the entropy regularization term, activating the TRPO strategy, and enriching the training datasets. Additionally, ACER has a continuous‐action counterpart that can be used to explore for continuous TPP tuning. These will be our next step work in the future.

This also raises another important issue worth exploring: the TPP tuning hyperspace. Since the beginning of inverse treatment planning, this hyperspace has remained largely unexplored, with dosimetrists relying on intuition and experience to navigate it. A DRL‐based VTP could effectively map this hyperspace and generate insights that dosimetrists can learn from, as illustrated by the prioritized actions derived from the ACER policy distribution. We believe that further investigation into the continuous and simultaneous tuning of multiple TPPs using advanced DRL strategies can not only enhance automation in real‐time treatment planning but also contribute to a deeper understanding of the TPP tuning hyperspace. Much like how DRL algorithms have revolutionized game strategies, like in chess, a well‐designed DRL agent has the potential to reshape conventional approaches to TPP tuning, ultimately guiding the radiotherapy clinics toward more effective treatment planning.

5. CONCLUSION

We have trained a deep reinforcement network with actor–critic experience replay technique for automatic treatment planning. The trained network can guide the in‐house treatment planning system for high quality treatment planning when using the prostate cancer IMRT as a testbed. The trained network has high generality, which performs well over patient data from sources distinct from the training patient dataset. It also shows high robustness over adversarial attack, demonstrating its potential in practical treatment planning in clinical settings.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Supporting information

Supporting Information

MP-52-0-s001.pdf^{(100.6KB, pdf)}

ACKNOWLEDGMENTS

This work is partially supported by the rising stars program of UT system and national institutes of health (NIH)/national cancer institute (NCI) grants R37CA214639, R01CA254377, and R01CA237269.

Abrar MM, Sapkota P, Sprouts D, Jia X, Chi Y. Actor critic with experience replay‐based automatic treatment planning for prostate cancer intensity modulated radiotherapy. Med Phys. 2025;52:e17915. 10.1002/mp.17915

REFERENCES

1. Webb S. The physical basis of IMRT and inverse planning. Br J Radiol. 2003;76(910):678‐689. [DOI] [PubMed] [Google Scholar]
2. Li X, Zhang J, Sheng Y, et al. Automatic IMRT planning via static field fluence prediction (AIP‐SFFP): a deep learning algorithm for real‐time prostate treatment planning. Phys Med Biol. 2020;65(17):175014. [DOI] [PubMed] [Google Scholar]
3. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016:770‐778.
4. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017:2261‐2269.
5. Vandewinckele L, Willems S, Lambrecht M, Berkovic P, Maes F, Crijns W. Treatment plan prediction for lung IMRT using deep learning based fluence map generation. Physica Med. 2022;99:44‐54. [DOI] [PubMed] [Google Scholar]
6. Lempart M, Benedek H, Gustafsson CJ, et al. Volumetric modulated arc therapy dose prediction and deliverable treatment plan generation for prostate cancer patients using a densely connected deep learning model. Phys Imagi Rad Oncol. 2021;19:112‐119. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Ronneberger O, Fischer P, Brox T. U‐net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer‐Assisted Intervention‐MICCAI 2015: 18th International Conference, Munich, Germany, October 5‐9, 2015, proceedings, part III 18 Springer; 2015:234‐241.
8. Kandalan RN, Nguyen D, Rezaeian NH, et al. Dose prediction with deep learning for prostate cancer radiation therapy: model adaptation to different treatment planning practices. Radiother Oncol. 2020;153:228‐235. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS. Adversarial attacks on medical machine learning. Science. 2019;363(6433):1287‐1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; 2018. [Google Scholar]
11. Mnih V, Kavukcuoglu K, Silver D, et al. Human‐level control through deep reinforcement learning. Nature. 2015;518(7540):529‐533. [DOI] [PubMed] [Google Scholar]
12. Shen C, Gonzalez Y, Klages P, et al. Intelligent inverse treatment planning via deep reinforcement learning, a proof‐of‐principle study in high dose‐rate brachytherapy for cervical cancer. Phys Med Biol. 2019;64(11):115013. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Shen C, Nguyen D, Chen L, et al. Operating a treatment planning system using a deep‐reinforcement learning‐based virtual treatment planner for prostate cancer intensity‐modulated radiation therapy treatment planning. Med Phys. 2020;47(6):2329‐2336. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Shen C, Chen L, Gonzalez Y, Jia X. Improving efficiency of training a virtual treatment planner network via knowledge‐guided deep reinforcement learning for intelligent automatic treatment planning of radiotherapy. Med Phys. 2021;48(4):1909‐1920. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Shen C, Chen L, Jia X. A hierarchical deep reinforcement learning framework for intelligent automatic treatment planning of prostate cancer intensity modulated radiation therapy. Phys Med Biol. 2021;66(13):134002. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Sprouts D, Gao Y, Wang C, Jia X, Shen C, Chi Y. The development of a deep reinforcement learning network for dose‐volume‐constrained treatment planning in prostate cancer intensity modulated radiotherapy. BioMed Phys & Engi Exp. 2022;8(4):045008. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Gao Y, Shen C, Jia X, Park YK. Implementation and evaluation of an intelligent automatic treatment planning robot for prostate cancer stereotactic body radiation therapy. Radiother Oncol. 2023;184:109685. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Zhu J, Wu F, Zhao J. An overview of the action space for deep reinforcement learning. In: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence , 2021:1–10.
19. Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284 . 2017.
20. Wang Z, Bapst V, Heess N, et al. Sample efficient actor‐critic with experience replay. arXiv preprint arXiv:1611.01224. 2016. [Google Scholar]
21. Abrar MM, Jia X, Chi Y. Deep Actor Critic Experience Replay (ACER) Reinforcement Learning‐Based Automatic Treatment Planning for Intensity Modulated Radiotherapy. In: AAPM 66th Annual Meeting & Exhibition, AAPM; 2024.
22. Mnih V, Badia AP, Mirza M, et al. Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning PMLR; 2016:1928‐1937.
23. Lin LJ. Self‐improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn. 1992;8:293‐321. [Google Scholar]
24. Varian. Eclipse Photon and Electron Algorithms Reference Guide. 2014.
25. Sutton RS, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst. 1999;12:1057–1063. [Google Scholar]
26. Munos R, Stepleton T, Harutyunyan A, Bellemare M. Safe and efficient off‐policy reinforcement learning. Adv Neural Inf Process Syst. 2016;29:1054–1062. [Google Scholar]
27. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: International Conference on Machine Learning . PMLR; 2015:1889–1897. [Google Scholar]
28. Tieleman T. Lecture 6.5‐rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Net Mach Learn. 2012;4(2):26. [Google Scholar]
29. Craft D, Bangert M, Long T, Papp D, Unkelbach J. Shared data for intensity modulated radiation therapy (IMRT) optimization research: the CORT dataset. GigaScience. 2014;3(1):2047–217X. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Breedveld S, Heijmen B. Data for TROTS–the radiotherapy optimisation test set. Data in Brief. 2017;12:143‐149. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. 2013. [Google Scholar]
32. Biggio B, Roli F. Wild patterns: Ten years after the rise of adversarial Mach Learn. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security , 2018:2154–2156.
33. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. 2014. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

MP-52-0-s001.pdf^{(100.6KB, pdf)}

[mp17915-bib-0001] 1. Webb S. The physical basis of IMRT and inverse planning. Br J Radiol. 2003;76(910):678‐689. [DOI] [PubMed] [Google Scholar]

[mp17915-bib-0002] 2. Li X, Zhang J, Sheng Y, et al. Automatic IMRT planning via static field fluence prediction (AIP‐SFFP): a deep learning algorithm for real‐time prostate treatment planning. Phys Med Biol. 2020;65(17):175014. [DOI] [PubMed] [Google Scholar]

[mp17915-bib-0003] 3. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016:770‐778.

[mp17915-bib-0004] 4. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017:2261‐2269.

[mp17915-bib-0005] 5. Vandewinckele L, Willems S, Lambrecht M, Berkovic P, Maes F, Crijns W. Treatment plan prediction for lung IMRT using deep learning based fluence map generation. Physica Med. 2022;99:44‐54. [DOI] [PubMed] [Google Scholar]

[mp17915-bib-0006] 6. Lempart M, Benedek H, Gustafsson CJ, et al. Volumetric modulated arc therapy dose prediction and deliverable treatment plan generation for prostate cancer patients using a densely connected deep learning model. Phys Imagi Rad Oncol. 2021;19:112‐119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0007] 7. Ronneberger O, Fischer P, Brox T. U‐net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer‐Assisted Intervention‐MICCAI 2015: 18th International Conference, Munich, Germany, October 5‐9, 2015, proceedings, part III 18 Springer; 2015:234‐241.

[mp17915-bib-0008] 8. Kandalan RN, Nguyen D, Rezaeian NH, et al. Dose prediction with deep learning for prostate cancer radiation therapy: model adaptation to different treatment planning practices. Radiother Oncol. 2020;153:228‐235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0009] 9. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS. Adversarial attacks on medical machine learning. Science. 2019;363(6433):1287‐1289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0010] 10. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; 2018. [Google Scholar]

[mp17915-bib-0011] 11. Mnih V, Kavukcuoglu K, Silver D, et al. Human‐level control through deep reinforcement learning. Nature. 2015;518(7540):529‐533. [DOI] [PubMed] [Google Scholar]

[mp17915-bib-0012] 12. Shen C, Gonzalez Y, Klages P, et al. Intelligent inverse treatment planning via deep reinforcement learning, a proof‐of‐principle study in high dose‐rate brachytherapy for cervical cancer. Phys Med Biol. 2019;64(11):115013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0013] 13. Shen C, Nguyen D, Chen L, et al. Operating a treatment planning system using a deep‐reinforcement learning‐based virtual treatment planner for prostate cancer intensity‐modulated radiation therapy treatment planning. Med Phys. 2020;47(6):2329‐2336. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0014] 14. Shen C, Chen L, Gonzalez Y, Jia X. Improving efficiency of training a virtual treatment planner network via knowledge‐guided deep reinforcement learning for intelligent automatic treatment planning of radiotherapy. Med Phys. 2021;48(4):1909‐1920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0015] 15. Shen C, Chen L, Jia X. A hierarchical deep reinforcement learning framework for intelligent automatic treatment planning of prostate cancer intensity modulated radiation therapy. Phys Med Biol. 2021;66(13):134002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0016] 16. Sprouts D, Gao Y, Wang C, Jia X, Shen C, Chi Y. The development of a deep reinforcement learning network for dose‐volume‐constrained treatment planning in prostate cancer intensity modulated radiotherapy. BioMed Phys & Engi Exp. 2022;8(4):045008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0017] 17. Gao Y, Shen C, Jia X, Park YK. Implementation and evaluation of an intelligent automatic treatment planning robot for prostate cancer stereotactic body radiation therapy. Radiother Oncol. 2023;184:109685. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0018] 18. Zhu J, Wu F, Zhao J. An overview of the action space for deep reinforcement learning. In: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence , 2021:1–10.

[mp17915-bib-0019] 19. Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284 . 2017.

[mp17915-bib-0020] 20. Wang Z, Bapst V, Heess N, et al. Sample efficient actor‐critic with experience replay. arXiv preprint arXiv:1611.01224. 2016. [Google Scholar]

[mp17915-bib-0021] 21. Abrar MM, Jia X, Chi Y. Deep Actor Critic Experience Replay (ACER) Reinforcement Learning‐Based Automatic Treatment Planning for Intensity Modulated Radiotherapy. In: AAPM 66th Annual Meeting & Exhibition, AAPM; 2024.

[mp17915-bib-0022] 22. Mnih V, Badia AP, Mirza M, et al. Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning PMLR; 2016:1928‐1937.

[mp17915-bib-0023] 23. Lin LJ. Self‐improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn. 1992;8:293‐321. [Google Scholar]

[mp17915-bib-0024] 24. Varian. Eclipse Photon and Electron Algorithms Reference Guide. 2014.

[mp17915-bib-0025] 25. Sutton RS, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst. 1999;12:1057–1063. [Google Scholar]

[mp17915-bib-0026] 26. Munos R, Stepleton T, Harutyunyan A, Bellemare M. Safe and efficient off‐policy reinforcement learning. Adv Neural Inf Process Syst. 2016;29:1054–1062. [Google Scholar]

[mp17915-bib-0027] 27. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: International Conference on Machine Learning . PMLR; 2015:1889–1897. [Google Scholar]

[mp17915-bib-0028] 28. Tieleman T. Lecture 6.5‐rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Net Mach Learn. 2012;4(2):26. [Google Scholar]

[mp17915-bib-0029] 29. Craft D, Bangert M, Long T, Papp D, Unkelbach J. Shared data for intensity modulated radiation therapy (IMRT) optimization research: the CORT dataset. GigaScience. 2014;3(1):2047–217X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0030] 30. Breedveld S, Heijmen B. Data for TROTS–the radiotherapy optimisation test set. Data in Brief. 2017;12:143‐149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[mp17915-bib-0031] 31. Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. 2013. [Google Scholar]

[mp17915-bib-0032] 32. Biggio B, Roli F. Wild patterns: Ten years after the rise of adversarial Mach Learn. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security , 2018:2154–2156.

[mp17915-bib-0033] 33. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. 2014. [Google Scholar]

PERMALINK

Actor critic with experience replay‐based automatic treatment planning for prostate cancer intensity modulated radiotherapy

Md Mainul Abrar

Parvat Sapkota

Damon Sprouts

Xun Jia

Yujie Chi

Abstract

Background

Purpose

Methods

Results

Conclusions

1. INTRODUCTION

2. METHODS AND MATERIALS

2.1. Overall architecture

FIGURE 1.

2.2. In‐house treatment planning system (TPS)

2.3. Actor critic with experience replay (ACER)‐based virtual treatment planner (VTP) system

2.3.1. Working principle of ACER

2.3.2. Establishment of the VTP system upon ACER architecture

2.3.3. Testbed

TABLE 1.

TABLE 2.

FIGURE 2.

2.3.4. Datasets

TABLE 3.

TABLE 4.

TABLE 5.

2.4. Adversarial attack

3. RESULTS

3.1. Training results

TABLE 6.

FIGURE 3.

3.2. Testing results

FIGURE 4.

FIGURE 5.

FIGURE 6.

TABLE 7.

FIGURE 7.

3.3. Adversarial attack

FIGURE 8.

TABLE 8.

4. DISCUSSION

5. CONCLUSION

CONFLICT OF INTEREST STATEMENT

Supporting information

ACKNOWLEDGMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases