Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 20.
Published in final edited form as: Phys Med Biol. 2022 Jul 27;67(15):10.1088/1361-6560/ac7fd6. doi: 10.1088/1361-6560/ac7fd6

Mitigating the uncertainty in small field dosimetry by leveraging machine learning strategies

Wei Zhao 1, Yong Yang 1, Lei Xing 1, Cynthia F Chuang 1,*, Emil Schüler 2,3,*
PMCID: PMC12091865  NIHMSID: NIHMS2056269  PMID: 35803256

Abstract

Small field dosimetry is significantly different from the dosimetry of broad beams due to loss of electron side scatter equilibrium, source occlusion, and effects related to the choice of detector. However, use of small fields is increasing with the increase in indications for intensity-modulated radiation therapy (IMRT) and stereotactic body radiation therapy (SBRT), and thus the need for accurate dosimetry is ever more important. Here we propose to leverage machine learning (ML) strategies to reduce the uncertainties and increase the accuracy in determining small field output factors (OFs).

Linac OFs from a Varian TrueBeam STx were calculated either by the treatment planning system (TPS) or measured with a W1 scintillator detector at various multi-leaf collimator (MLC) positions, jaw positions, and with and without contribution from leaf-end transmission. The fields were defined by the MLCs with the jaws at various positions. Field sizes between 5 and 100 mm were evaluated. Separate ML regression models were generated based on the TPS calculated or the measured datasets.

Accurate predictions of small field OFs at different field sizes (FSs) were achieved independent of jaw and MLC position. A mean and maximum % relative error (RE) of 0.38±0.39% and 3.62%, respectively, for the best-performing models based on the measured datasets were found. The prediction accuracy was independent of contribution from leaf-end transmission.

Several ML models for predicting small field OFs were generated, validated, and tested. Incorporating these models into the dose calculation workflow could greatly increase the accuracy and robustness of dose calculations for any radiotherapy delivery technique that relies heavily on small fields.

1. INTRODUCTION

Stereotactic radiosurgery has long been the major radiation therapy (RT) modality to use small fields. However, increasing indications for stereotactic body radiation therapy (SBRT) have led to corresponding increases in the use of small fields or small segments in everyday treatments.(Palmans et al., 2018; Followill et al., 2012; Timmerman and Xing, 2009) The need for accurate dosimetry is thus of extra importance in these cases, as these types of treatments are generally performed with only a few high-dose fractions. Indeed, errors related to the dosimetry of small fields have resulted in mistreatment in numerous patients, with delivery of doses of up to 50% higher than those prescribed.(Derreumaux et al., 2008; Bogdanich and Ruiz, 2010)

Small field dosimetry is complicated by several factors not readily seen in the standard delivery of broad beams.(Andreo, 2018; Palmans et al., 2018; IAEA TRS-483, 2017; Das et al., 2008) (i) The loss of lateral charged particle equilibrium (LCPE) in small fields results in a breakdown of the relationship between absorbed dose and kerma by a constant factor. The higher the energy of the primary beam, the larger the field size (FS) is when LCPE is lost.(Li et al., 1995) (ii) Partial source occlusion will cause a direct photon penumbra overlap with a reduction in output, as the detector will not be irradiated by the full source.(Andreo, 2018; Das et al., 2008) This effect also causes a breakdown in the relationship between collimator settings and the full-width-half-maximum (FWHM) of the beam, which is normally used to define the FS in broad beam dosimetry. (iii) The dosimetric uncertainties of small fields are also related to the choice of detector.(Andreo, 2018; Sánchez-Doblado et al., 2007) The type, design, and size of the detector influence the response to a greater degree for small fields than for broad beams, and corrections for volume averaging and fluence perturbation are especially concerning.(Das et al., 2008; Ding et al., 2007a; Ding et al., 2007b) (iv) Finally, the mean energy of the beam in the phantom/patient increases with decreasing FS, with a corresponding decrease in the low energy component.(Seuntjens et al., 2014; Benmakhlouf et al., 2014) This will affect the response of any detector that is designed with materials of an effective atomic number different from water.

The issues of loss of LCPE, partial source occlusion, and volume averaging also complicate the calculations of absorbed dose in modern treatment planning systems (TPSs) that use model-based algorithms.(Followill et al., 2012) Loss of LCPE may be manifested as irregularities in dose calculations of small fields due to simplifications in the modeling of the lateral electron scattering.(Seuntjens et al., 2014) Such irregularities will be most prominent in highly heterogeneous media. Further, commissioning a TPS with incorrect or extrapolated beam data will cause systematic errors in dose delivery.(Sharma et al., 2017; Andreo, 2018; Lechner et al., 2018; Bogdanich and Ruiz, 2010; Derreumaux et al., 2008) The resulting effects on the overall treatment plan depend on the fraction of fields/segments used that fall into the category of small fields.(Lechner et al., 2020)

Machine learning (ML) could be a powerful tool to reduce the uncertainties in delivered dose associated with small field dosimetry. ML has previously been used successfully in machine quality assurance (QA),(Carlson et al., 2016) patient-specific QA(Valdes et al., 2017; Chan et al., 2020),(Fan et al., 2020), proton monitor unit and dose calculations,(Sun et al., 2018; Nomura et al., 2020) and beam data predictions for linac commissioning.(Zhao et al., 2020) Modeling the correlation between field-specific parameters and output factors (OFs) would allow prediction of OFs with a different set of features from what the model was trained on. These models could be designed to be linac model-specific instead of machine-specific, as many vendors use the same beam model in TPSs for all machines of a specific model. Indeed, modern linacs generally show good agreement to each other when it comes to measured beam data,(Glide-Hurst et al., 2013) allowing institutions to use beam data provided by the vendor for TPS modeling, with spot checks performed by each institution. Adopting a ML approach would also be in line with recommendations from the American Association of Physicists in Medicine on the increased use of automation in the clinic to reduce and eliminate error-prone tasks in clinical medical physics.(Huq et al., 2016)

The overall goal of this study was to investigate ML-based strategies to generate predictive models of small field OFs. To achieve this goal, we i) evaluated the performance characteristics of the TPS calculated OF in relation to measured data, ii) developed regression models to predict OFs, and iii) tested these models against data acquired for situations not previously seen by the models, or on data acquired on a different linac with a different chamber. Large differences between the TPS calculated and measured OFs were found for the smallest fields evaluated. Utilizing a ML based approach using measured data greatly reduced the relative error in OF determination. These ML models could serve as a basis for QA or as a correction factor for dose calculations to increase the accuracy and safety of patient treatments.

2. MATERIALS AND METHODS

2.A. Machine

A Varian TrueBeam STx (Varian, Palo Alto, CA) linac was used in the current study. The linac was equipped with high-definition multi-leaf collimators (HD120 MLC).(Chang et al., 2008) The central 8 cm of the HD120 MLCs has a leaf width of 2.5 mm projected at the isocenter. Beyond 8 cm, the projected width at the isocenter is 5 mm. The machine was calibrated to deliver 1 cGy/MU at the depth of maximum dose (dmax) in water for a 10×10cm2 field at a source-to-surface distance (SSD) of 100 cm, according to the TG-51 recommendations.(Almond et al., 1999) The 6-MV beam was used for all calculations and measurements.

2.B. Field generation and TPS-calculated OFs

In the Eclipse TPS, different FSs were generated through the Eclipse scripting interface (Eclipse API, Varian, Palo Alto, CA). The generated FSs were symmetric around the isocenter. In the central 40×40 mm2 region, leaf openings of 5, 10, 15, 20, 25, 30, and 40 mm were used in different combinations to generate square and rectangular fields defined by the MLCs. For each MLC-defined field, different combinations of jaw sizes were generated, with the FS in the X and Y direction of the jaws always being equal to or greater than the MLC FS. Beyond the central 40×40 mm2, only square fields were generated with opening lengths of 60, 80, and 100 mm. All combinations of MLC and jaw FSs were generated for two scenarios: when the closed leaves meet at the center line (MLC central position) and when the closed leaves meet behind one of the jaws (MLC out position) [Fig. 1(A) and (B)].

Figure 1.

Figure 1.

Illustrations of the (A) multileaf collimator (MLC) center and (B) MLC out positions, where the closed leaves meet at the center line or behind one of the jaws, respectively. (C) Workflow for model training and testing. Output factors were collected from different settings and pre-processed accordingly. Data calculated by the treatment planning system (TPS) and measured beam data were collected with various field sizes with MLC at center/out positions. During the training phase, one dataset was left out for validation and the rest were used to build the model. The hyperparameters were optimized during the validation process. Independent calculated data and measured data that were unseen during the training phase were used for evaluation studies.

Dose calculations were made on a simulated water phantom with a size of 30 × 30 × 40 cm3. The source-to-surface distance was set at 90 cm. The dose was scored at the central axis of the beam at a depth of 10 cm. The dose calculation grid resolution was set to 1 mm. Both the analytical anisotropic algorithm (AAA) and the Acuros XB (AXB) algorithm were used in the calculations. All scored doses were normalized to the 100×100 mm2 field with MLC out position. In total, doses from 1570 different combinations of MLC and jaw positions were scored.

2.C. OF measurements

The OFs were measured with an Exradin W1 scintillator detector (Standard Imaging, Middleton, WI). The detector is cylindrical with a length of 3 mm and a diameter of 1 mm. The detector was calibrated according to the directions from the manufacturer.

The simulated geometry (see section 2.B.) was recreated on the linac with some modifications: solid water was used instead of pure water; the detector was placed at 10 cm depth at a source-to-surface distance of 90 cm, with 10 cm solid water for back scatter; and the detector was centered in the field by visually aligning it to the light field cross hairs. To further pinpoint the exact center position of the detector in the radiation field, we followed the recommendations of ICRU 91,(Seuntjens et al., 2014) modified for our setup as follows: A 5×5 mm2 field was delivered several times with small couch movements (0.1 mm step size) in the lateral and longitudinal directions between each delivery to find the maximum signal. The maximum signal corresponded to the exact centering of the detector in the beam. This process was repeated every time new measurements were acquired.

Single delivery of all 1570 fields with different MLC and jaw positions was performed, and ~100 fields were redelivered intermittently to verify the consistency of the measurements. All scored doses were normalized to the 100×100 mm2 field with MLC out position.

Comparisons between measured and TPS-generated OFs were evaluated by calculating the percent relative difference

%RD=100%XmeasuredXTPSXTPS

2.D. Machine learning models and model validation

We defined the OF predictions as a supervised learning problem and evaluated several regression algorithms for prediction accuracy. The two data sets of OFs (TPS calculated and measured OFs) were both used separately to develop prediction models. The Scikit-learn toolbox was used in all cases.(Pedregosa et al., 2011)

2.D.1. Kernel ridge regression

Kernel ridge regression (KRR) is a non-parametric form of ridge regression by optimizing an objective function consisting of a regularization penalty (i.e. L2-norm) and a data fidelity loss term.(Murphy, 2012) Its aim is to learn a function in the hyperspace spanned by a set of kernels by minimizing a squared loss with the squared norm regularization term. In this study, the features used for OF prediction included the opening sizes for both the MLCs and jaws in the X and Y directions. Although these features are physically independent of each other, they were set to discretized values with a fixed minimum interval (i.e., 5 mm) during the OF calculation and measurement for easy clinical implementation. However, these discretized features are potentially collinear, and predictive models with collinear features may not allow effective interpretation of changes in individual features. Beyond increasing the size of training data, the regularization scheme in the KRR algorithm is particularly useful to mitigate the problem of feature multicollinearity. Here we used the radial basis function kernel to span the function space, and the hyperparameters (the regularization weight α and the bandwidth of the radial basis function kernel γ) were optimized by using 5-fold cross-validated grid search. Specifically, the regularization weight α and the bandwidth γ were optimized over the grid of 10−3 to 1 and 10–2 to 102, both of which are within the interval of an order of magnitude.

2.D.2. Random forest regression

Random forest regression (RFR) is an ensemble learning method that operates by constructing a multitude of decision trees at the training stage and outputting the average prediction of each individual tree.(Breiman, 2001) Each tree is trained by using a randomly sampled subset based on a random subset of features (i.e., randomly selected MLC and jaw features). Because individual deep trees are more likely to learn highly irregular patterns and overfit to the training datasets, the average operation corrects for the individual tree’s habit of overfitting, thereby boosting the performance of the final model. By pulling together individual random tree efforts, RFR can deal with nonlinear relationships between high-dimensional data with high accuracy and robustness against overfitting. Like the KRR method, the tree-based approach is also robust against multicollinearity in the MLC and jaw features.

For realistic implementation, the RFR encompasses several hyperparameters that can be tuned to improve the model performance during cross-validation. These hyperparameters include the number of trees in the forest, the maximum depth of the trees, the minimum number of samples of splits, and the minimum sample of leaves. We used 1000 trees (estimators) to yield the optimized regression results; the model performance did not change significantly when the number of trees was more than 1000, suggesting it was not overfitted to the training data. The minimum number of samples required to split an internal node was set to 2, and all features were used to determine the split.

2.D.3. Decision tree regression with AdaBoost

Like the RFR method, decision tree regression with AdaBoost (ADA) is also an ensemble learning method and makes predictions based on a number of decision trees. Unlike the RFR, which trains a set of individual trees in a parallel fashion by using randomly sampled data subsets, ADA trains each individual tree sequentially, and each tree learns from mistakes made by the previous tree.(Drucker, 1997) To learn from the previous mistake, the model increases the weight of mismodeled samples during the learning process, and the individual trees are trained using the same sampled data but different weights. Specifically, the ADA method is carried out in four steps. First, a weighted error rate that characterizes how well the individual tree performed is calculated for each trained decision tree based on the specific weight of each sample. Second, the weight of the individual tree in the ensemble is then obtained using the error rate; a higher weighted error rate yields a lower decision power of the tree, and vice versa. Third, the weight of each data sample is updated according to the predicted performance for the specific sample. Fourth, all the trees are updated by repeating the procedures starting from Step 1 in a sequential fashion. Once all updates are finalized, the final prediction is the weighted summation of the outcomes of all the individual trees, and the tree with higher weight will contribute more to the final decision.

2.D.4. Gradient boosting regression

Gradient boosting regression (GBR) is another ensemble learning method that uses a boosting algorithm to reduce the prediction bias. Unlike the ADA method, which updates the weights of samples to improve the model performance, GBR directly learns from the prediction residual error.(Friedman, 2001) Specifically, after training an individual decision tree, its prediction errors are treated as a set of new training data to train a sequential tree, and this process is repeated until the number of trees we set to train is reached. The final new prediction is made by adding up the predictions of all trees. Because GBR is fairly robust against overfitting, we used 1000 individual decision trees to boost the model performance.

2.D.5. Voting regression

Voting regression (VR) is an ensemble meta-estimator that combines different ML regressors (e.g., RFR and GBR) and averages each individual prediction to form a final prediction. All regressors in the VR use the same data samples, i.e., the whole datasets. By averaging the predictions from different regressors, the VR method is able to balance out the individual weakness of each regressor to yield a robust prediction. In this study, we used ADA and GBR as the meta-estimators. During implementation, the occurrences of predicted values from individual regressors can be weighted before averaging; we used uniform weights for ADA and GBR.

For all the different algorithms, the features included FS openings in the X and Y direction for both MLCs and jaws, differences in X and Y positions for the MLCs and jaws, and position of the MLC closed leaves.

During the validation, a random selection of 200 fields was extracted from the training data. The 200 fields were filtered to ensure that the positions of the MLCs and the jaws did not overlap; i.e., a 40×40 mm2 FS based on MLC and a 40×40 mm2 FS for jaws would not be included in the validation set. To avoid bias in prediction accuracy based on the selection of the validation set, this random selection and extraction of fields for validation purposes was repeated 100 times and the data were pooled for analysis. Furthermore, this exercise was repeated for different settings of each tunable model parameter that was expected to have a large influence on the prediction accuracy in the specific algorithms to maximize the prediction accuracy.

The predictions were evaluated by calculating the percent relative error. For models based on the TPS calculated OFs, the percent relative error was calculated as:

%RE=100%XpredictedXTPSXTPS

For models based on the measured OFs, the percent relative error was calculated as:

%RE=100%XpredictedXmeasuredXmeasured

The data was evaluated using box plots (Supplementary Figures 14) where the yellow line represents the median of the dataset, the lower and upper boundaries of the box represent the first (Q1) and third quartile (Q=3), respectively. The whiskers extend from be box by 1.5 times the inter-quartile range (IQR), which is the difference between Q3 and Q1. Any datapoint beyond the ± 1.5 IQR are then represented by circles.

An equivalent square FS (eqFS) was calculated as eqFS=2aba+b where a is the field width and is the field length, based on FWHM.(Khan and Gibbons, 2014; IAEA TRS-483, 2017) Although this is a crude method, it allows data to be visualized in a common graph.

2.E. Model testing

To test the robustness and the accuracy of the models, they were tested on data that had not previously been seen by the models. The models which were based on TPS calculated data was tested on non-symmetric FSs created in Eclipse in which the length of one side of the field was decreased in equidistant steps while the rest (jaws and measurement point) remained fixed. The dose was scored and compared with that predicted by the models.

The models which were trained on measured OFs from the TrueBeam STx with HDMLCs were further tested against a testing dataset acquired on a different linac with a different detector. That linac was a TrueBeam with Millennium 120 MLCs, which have a leaf width of 5 mm projected at iso in the central 10 cm of the field. The machine was calibrated to deliver 1 cGy/MU at the depth of maximum dose (dmax) in water for a 10×10cm2 field at a SSD of 100 cm, according to the TG-51 recommendations.(Almond et al., 1999) The 6-MV beam was used for all measurements. The EDGE diode detector (Sun Nuclear, Melbourne, FL) was used for measuring the OFs. It is known that the EDGE diode detector response is dependent on FS. For this reason, the OFs were corrected with respect to the procedure and values published in the International Atomic Energy Agency/ American Association of Physicists in Medicine TRS-483 report.(IAEA TRS-483, 2017)

The testing data consisted of OFs from square FSs with side openings of 10, 20, 30, 40, 60, 80, and 100 mm and of rectangular FSs of 10×40, 20×40, 30×40, 40×30, 40×20, and 40×10 mm. Multiple jaw settings were used with the stipulation that the jaw-defined FS ≥ the MLC-defined FS. The largest jaw setting was set to 10×10 cm2. Both MLC center and MLC out positions were included in the testing data.

3. RESULTS

3.1. Measured vs. TPS-calculated OFs

The 1570 different combinations of MLC and jaw that were simulated in the TPS were also delivered on the corresponding linac. Most of the fields investigated were <30×30 mm2 [Fig. 2(A)]. The mean %RD between measured and TPS-calculated OFs was 5.9% ± 5.4%, with a range of –7.1% to 18.1% [Fig. 2(B)]. A smaller FS was associated with a larger %RD [Fig. 2(C) and (D)]. Moreover, a larger %RD was observed when the MLCs closed centrally rather than behind a jaw [Fig. 2(C)]. This was observed for both AAA and AXB algorithms, with AXB showing a larger difference relative to the measured values as compared with the AAA algorithm for the MLC center position for the smallest field (5×5 mm2) [Fig. 2(E) and (F)]. The two algorithms showed no difference in calculated OF for the smallest field for the MLC out position.

Figure 2.

Figure 2.

Comparison between measured and TPS-calculated output factors. (A) Histogram of the field sizes investigated and incorporated into the prediction models; (B) Histogram of the percent difference between the measured and TPS calculated output factors using the Eclipse AAA algorithm; (C) Percent relative difference (%RD) between measured and TPS-calculated output factors (OFs) when MLC were in the central and out position, respectively; (D) Percent difference between measurements and TPS-calculated OFs with fields defined by the MLCs and the jaws in a fixed 10×10 cm2 setting. (E) and (F) Percent difference between measured and TPS-calculated OFs of a 5×5mm2 MLC-defined field when the closed MLCs are in the center and out position, respectively.

3.2. Model optimization and validation

Models were trained either based on TPS calculated OFs or on measured OFs, and the performance was analyzed separately. The same workflow was used independent of training dataset.

Figure 3 shows the performance of the five different algorithms (KRR, RFR, ADA, GBR, and VR) used for modeling small field OFs after model optimization. The datasets acquired from the TPS and from the measurements were individually used for model building. For both ADA and RFR, 1000 estimators were used (Suppl. Fig. 1). For GBR, 1000 estimators were used with a maximum depth of the individual regression estimators of 5 (Suppl. Figs. 2 and 3). The rest of the parameters were set to the default values as specified by the Scikit-learn implementation.(Pedregosa et al., 2011) Once the individual models had been optimized, the combinations of the algorithms were evaluated by the VR algorithm. The optimal combination for minimizing the prediction error was found to be the combination of ADA and GBR (Suppl. Fig. 4).

Figure 3.

Figure 3.

Performance of (A) kernel ridge regression, (B) random forest regression, (C) decision-tree regression with AdaBoost, (D) gradient boosting regression, and (E) voting regression for output field (OF) predictions. Models were generated from both measured and TPS-calculated data. Voting regression was performed with decision-tree regression with AdaBoost and gradient boosting regression.

Of the individual algorithms, the GBR showed the best performance in the validation run (Fig. 3 and Table 1). The mean and max absolute %RE were 0.45% ± 0.43 and 3.62% for the measured dataset model and 0.09% ± 0.12 and 2.14% for the TPS dataset model. The percent of predictions with a %RE >2% was 1.5% and that for a %RE >3% was 0.1% for the measured dataset model. For the TPS dataset model, the corresponding values were 0.015% (>2%) and 0% (>3%). Similar performance metrics were found for the ADA algorithm. No significant improvement in prediction accuracy was found when the two algorithms were combined by using the VR algorithm (Suppl. Fig. 4). The KRR-based models also showed a disproportionate amount of large prediction errors related to the smallest field (5×5 mm2) relative to the other models (Suppl. Fig. 5). The other models showed that FS had no or little effect on prediction accuracy independent of which data set was used for model training.

Table 1.

Model validation for output field prediction

Algorithm Mean %RE Mean %RE (Abs)* Max %RE (Abs)* Mean Δ (Abs)** Max Δ (Abs)** > 2% %RE (Abs)* > 3% %RE (Abs)*

Kernel Ridge Regression

  Measured data 0.09 ± 2.85 2.21 ± 1.80 32.26 1.56 ± 1.25 23.67 48.3% 23.5%
  TPS calculated data 0.23 ± 2.82 1.91 ± 2.09 41.81 1.25 ± 1.27 26.72 33.2% 13.9%

Random Forest Regression

  Measured data −0.05 ± 0.69 0.46 ± 0.51 5.38 0.33 ± 0.38 4.68 1.7% 0.8%
  TPS calculated data −0.04 ± 0.24 0.15 ± 0.19 3.08 0.10 ± 0.13 1.83 0.1% 0.0%

DTR with AdaBoost

  Measured data −0.04 ± 0.58 0.39 ± 0.43 4.08 0.28 ± 0.32 3.73 0.9% 0.4%
  TPS calculated data −0.04 ± 0.21 0.13 ± 0.17 2.81 0.09 ± 0.12 1.91 0.1% 0.0%

Gradient Boosting Regression

  Measured data −0.01 ± 0.62 0.45 ± 0.43 3.62 0.32 ± 0.32 3.00 1.5% 0.1%
  TPS calculated data −0.01 ± 0.16 0.09 ± 0.12 2.14 0.06 ± 0.07 1.10 0.0% 0.0%

Voting Regression (ADA + GBR)

  Measured data −0.03 ± 0.55 0.38 ± 0.39 3.62 0.27 ± 0.29 3.00 1.0% 0.3%
  TPS calculated data −0.03 ± 0.16 0.09 ± 0.13 2.64 0.06 ± 0.09 1.40 0.0% 0.0%
*

Absolute value

**

Absolute percent point difference in output between predicted and measured/TPS calculated data

DTR = Decision tree regression

3.3. Model testing

The models were tested in two ways. First, the models were tested on datasets that had not previously been seen by the models, including non-symmetric FS with either MLC center position or MLC out position (Fig. 4). Again, the KRR-based models showed the worst performance for predicting OFs. Only small differences were found between the other algorithms, indicating that these algorithms were robust for situations involving non-symmetric rectangular fields. This testing was only performed using the models based on the TPS generated OFs due to difficulties in performing accurate measurements in these situations.

Figure 4.

Figure 4.

Models were tested on types of cases not previously seen during the training of the models. Four cases of asymmetric fields were included in the model testing: (A) MLC out position with varying field openings towards the X2 jaw side of the field; (B) MLC out position with varying openings towards the Y1 jaw side of the field; (C) MLC center position with varying openings towards the X2 jaw side of the field; and (D) MLC center position with varying openings towards the Y1 jaw side of the field. All fields were defined by the MLCs and the jaws were set at 50 ×50 mm2. The dots in the illustrations indicate the position where the dose was scored, and the dashed lines indicate the varying positions of the MLCs.

The models were also tested against a testing dataset acquired on a different linac with a different chamber (Sun Nuclear EDGE detector) (Table 2). The FWHM of the fields delivered were in agreement with the nominal FSs defined by the MLCs (determined through Gafchromic film measurements (data not shown)). Field output correction factors were applied according to the method presented in TRS-483.(Palmans et al., 2018) The analysis of the OFs confirmed the results from the validation analysis. The VR model showed lowest mean absolute %RE and lowest mean percent point difference between predicted and measured data (0.42% ± 0.33 and 0.3% ± 0.27, respectively). However, only small differences were found between the RFR, ADA, GBR and VR models (Table 2).

Table 2.

Model testing for output factor prediction

Algorithm Mean %RE Mean %RE (Abs)* Max %RE (Abs)* Mean Δ (Abs)** Max Δ (Abs)** > 2% %RE (Abs)* > 3% %RE (Abs)*
Kernel Ridge Regression
  Testing data 0.06 ± 6.17 4.96 ± 3.64 18.56 4.13 ± 3.24 17.11 75.3% 61.9%
Random Forest Regression
  Testing data −0.10 ± 0.65 0.49 ± 0.43 1.79 0.40 ± 0.35 1.52 0.0% 0.0%
DTR with AdaBoost
  Testing data −0.09 ± 0.76 0.58 ± 0.49 2.05 0.48 ± 0.42 1.80 1.0% 0.0%
Gradient Boosting Regression
  Testing data −0.01 ± 0.81 0.63 ± 0.51 2.30 0.51 ± 0.40 1.95 2.1% 0.0%
Voting Regression (ADA + GBR)
  Testing data −0.42 ± 0.33 0.42 ± 0.33 2.89 0.30 ± 0.27 2.37 1.0% 0.0%
*

Absolute value

**

Absolute percent point difference in output between predicted and measured data

DTR = Decision tree regression

4. DISCUSSION

In this study, multiple tree-based ML algorithms were employed to evaluate the use of an AI approach for OF calculations. For easy clinical implementation, the input features used discretized values with fixed minimum intervals during the OF measurements. The conditional control statement features made the tree-based algorithms particularly suitable for output factor prediction. Different from the tree-based algorithms, we also evaluated the kernel ridge regression (KRR) which aims to learn a function in the hyperplane by optimizing an objective function consisting of a regularization penalty and a data fidelity term. Hence, the KRR algorithm can be regarded as an advanced interpolation method compared to the classical interpolation methods, such as linear interpolation and spline interpolation, and should thus be better suited for the output factor prediction problem.

Among the five different ML algorithms used, four showed comparable results with each other (RFR, ADA, GBR, and VR). The KRR algorithm was associated with the highest uncertainty in OF prediction and was not considered for further evaluation. The 1570 different combinations of MLCs and jaws that were simulated in the TPS were also delivered on the corresponding linac. For most of the combinations, only one measurement was performed. About 100 fields were delivered more than once to verify consistency of the measured OF. No significant deviation in OF was observed for these fields. Moreover, any erroneous measurement would have only a small effect on the overall fit generated by the models because of the high number of measurements taken and the small change in parameters between subsequent fields.

The measured OFs were acquired with the Exradin W1 detector, a plastic scintillator detector chosen because it was the only chamber described in TRS-483 that was both recommended for FSs down to 5×5 mm2 and did not need a field output correction factor to be applied.(Palmans et al., 2018; IAEA TRS-483, 2017) The latter also served the purpose of increasing the number of FSs that could be investigated accurately, as the field output correction factor is based on the equivalent square FS, which can be accurately determined only when the ratio of the sides of the beam are between 0.7 and 1.4.(IAEA TRS-483, 2017) Other advantages of the Exradin W1 detector are water equivalence, linear response to dose and dose rate, energy independence, and temperature independence, making this chamber ideal for small field OF measurements. (Beaulieu and Beddar, 2016; Beaulieu et al., 2013; Beddar et al., 1992a; Beddar et al., 1992b)

For acquiring the testing data, we used the Sun Nuclear EDGE detector. This chamber was used to avoid any bias in the choice of detector in the data sets taken. It is also an TRS-483 recommended chamber to measure equivalent square FSs down to 8×8mm2.(IAEA TRS-483, 2017) The FWHM vs nominal FS was found to be in agreement, as has also been reported previously for 6 MV beams down to 10×10 mm2 FS.(Benmakhlouf et al., 2014) The acquired testing datasets consisted of FSs that were beyond the recommended beam side ratio interval of 0.7 to 1.4, as recommended by TRS-483 to accurately determine the equivalent square FS with the method reported therein. Therefore, we would expect a larger uncertainty in the applied field output correction factor for these fields.

The OFs measured with the W1 detector were found to deviate significantly from the TPS-calculated OFs. The measured OFs were, however, comparable with previously published values.(Kerns et al., 2016; Akino et al., 2018) The magnitude of difference between TPS-calculated and measured OFs depended on the MLC FS, the Jaw FS, and the MLC position. For 5×5 mm2 fields, differences of >10% were found when the MLCs were in the out position, and differences of >15% were found when MLCs were in the center position. This large deviation is likely caused by the simplifications in dose calculation made in modern TPSs. The current TPS in Eclipse only has a few tuning parameters that may affect small field outputs (e.g., effective focal spot size, dosimetric leaf gap, and MLC transmission). However, the requirement for the TPS commissioning typically only involve measurements for medium to large fields (down to 3×3 cm2). Therefore, even though the TPS model works well for 5×5 cm2 up to 30×30 cm2, the extrapolation to smaller fields (2×2 cm2 and smaller) may be inaccurate, as is evident by the data presented here. Furthermore, there are also simplifications in lateral electron scattering and extrapolation of beam data from large to small FSs, which only allow measurements down to 2×2 cm2 field size, will cause errors in the calculated dose.(Sharma et al., 2017; Andreo, 2018; Lechner et al., 2018; Seuntjens et al., 2014) These errors combined are manifested as overestimates of the OFs by the TPS.(Lechner et al., 2020; Lechner et al., 2018; Stock et al., 2005; Wolfs et al., 2018; McNiven et al., 2010)

Considering the large errors introduced by the TPS-calculated OFs when compared to the measured OFs, the two data sets could not be combined to increase the number of data points in the training set. Instead, we generated separate models using both the measured and TPS-calculated datasets. The rational for generating models based on both data sets was to test the ML models on separate groups of testing data. The TPS generated models were thus only generated for comparative purposes. Any proposed implementation of predictive models would be for models based on measured data only.

Concerning the accuracy of predictions of OFs, the models based on the RFR, ADA, GBR, and VR algorithms all performed well, with a mean absolute %RE of <0.46%. These models also performed well when tested on data that were outside of the scope of data acquired for training purposes, e.g., non-symmetric FS with either MLC center position or MLC out position. Only small differences were found between the tree-based algorithms, indicating that these algorithms are robust for situations involving non-symmetric rectangular fields which the models had not been trained on.

The overall largest difference between the models was in the maximum error of the predictions, in which the RFR-based model showed the worst performance compared with the GBR- and VR-based models (5.38% vs 3.62%, respectively) when trained on measured data. The best overall performance characteristics were found with the VR algorithm. The VR algorithm is a meta-estimator that combines several regressors and in the end averages the individual predictions to form a final prediction.(Pedregosa et al., 2011) In this study, only the ADA and GBR were combined in the VR algorithm, based on the testing performance shown in Supplementary Figure 4. Combining different algorithms into a single prediction allows us to benefit from the strengths of the individual algorithms while minimizing the fall-out from non-ideal performance in specific, extreme situations. The negative aspect of this type of pooled prediction is reduced accuracy in specific situations where one algorithm performs better than the other.

A limitation of the proposed implementation presented here is that it is limited to OF predictions of square and rectangular FSs. Highly complex field shapes cannot be predicted directly, and those fields would need to be converted to an equivalent square FS. However, converting the field to an equivalent square FS would also introduce added uncertainty, especially with highly complex field shapes. In these scenarios, ML learning approaches using deep neural networks is one potential strategy as these models can most likely learn the complex nonlinear mapping relationship between the fields and the output factors. Specifically, the neural networks extract feature maps from the fields and the complex field shapes are encoded into the feature maps. The network would be able to be trained by using the output factors and the feature maps, during which the mapping relationship between the output factors and the field shapes would be encoded into the weights of the networks. With this process, the output factors of highly complex field shapes can be predicted using ML approaches. Another limitation is in the dataset used for training. The relevant models for clinical implementation were trained based on the measurement data from a Varian TrueBeam STx machine and would therefore be specific for this model. However, due to the great similarity in the manufacturing procedure for the TrueBeam platform, they are matched very closely, and our models were able to accurately predict the OF also for a linac based on the general TrueBeam platform.

5. CONCLUSIONS

We propose a fast and accurate ML-based method to generate small field OFs for routine radiation therapy. Tree-based models yielded superior results compared to methods based on advanced interpolation strategies. The VR algorithm, when combining the ADA and GBR, showed the best overall performance. With the method presented here, small field OFs can be accurately generated by using previous acquired OFs at different linac settings, which negates the need for time-consuming and complicated measurements without affecting the accuracy of the data. The predictions may serve as input for dose calculations to overcome the limitations of modern TPSs in calculating dose for small fields, or as a secondary verification tool for use in QA processes.

Supplementary Material

Supplementary Material

Acknowledgements

We thank Christine F. Wogan, MS, ELS, of MD Anderson’s Division of Radiation Oncology, for editorial contributions to several drafts of this article. This work was partially supported by NIH/NCI (1R01CA223667, 1R01CA227713, and 1R01CA256890), a Faculty Research Award from Google Inc, and by Cancer Center Support Grant P30 CA016672 from the National Cancer Institute of the National Institutes of Health, to The University of Texas MD Anderson Cancer Center.

References

  1. Akino Y, Mizuno H, Tanaka Y, Isono M, Masai N and Yamamoto T 2018. Inter-institutional variability of small-field-dosimetry beams among HD120 multileaf collimators: a multi-institutional analysis Physics in Medicine & Biology 63 205018 [DOI] [PubMed] [Google Scholar]
  2. Almond PR, Biggs PJ, Coursey BM, Hanson WF, Huq MS, Nath R and Rogers DW 1999. aAAPM’s TG-51 protocol for clinical reference dosimetry of high-energy photon and electron beams Medical physics 26 1847–70 [DOI] [PubMed] [Google Scholar]
  3. Andreo P 2018. The physics of small megavoltage photon beam dosimetry Radiotherapy and Oncology 126 205–13 [DOI] [PubMed] [Google Scholar]
  4. Beaulieu L and Beddar S 2016. Review of plastic and liquid scintillation dosimetry for photon, electron, and proton therapy Physics in Medicine & Biology 61 R305. [DOI] [PubMed] [Google Scholar]
  5. Beaulieu L, Goulet M, Archambault L and Beddar S Journal of Physics: Conference Series,2013), vol. Series 444): IOP Publishing; ) p 012013 [Google Scholar]
  6. Beddar A, Mackie T and Attix F 1992a. Water-equivalent plastic scintillation detectors for high-energy beam dosimetry: I. Physical characteristics and theoretical considerations Physics in Medicine & Biology 37 1883. [DOI] [PubMed] [Google Scholar]
  7. Beddar AS, Mackie T and Attix F 1992b. Water-equivalent plastic scintillation detectors for high-energy beam dosimetry: II. Properties and measurements Physics in Medicine & Biology 37 1901. [DOI] [PubMed] [Google Scholar]
  8. Benmakhlouf H, Sempau J and Andreo P 2014. Output correction factors for nine small field detectors in 6 MV radiation therapy photon beams: a PENELOPE Monte Carlo study Medical physics 41 041711 [DOI] [PubMed] [Google Scholar]
  9. Bogdanich W and Ruiz RR 2010. Radiation errors reported in Missouri New York Times 24 [Google Scholar]
  10. Breiman L 2001. Random forests Machine learning 45 5–32 [Google Scholar]
  11. Carlson JN, Park JM, Park S-Y, Park JI, Choi Y and Ye S-J 2016. A machine learning approach to the accurate prediction of multi-leaf collimator positional errors Physics in Medicine & Biology 61 2514. [DOI] [PubMed] [Google Scholar]
  12. Chan MF, Witztum A and Valdes G 2020. Integration of AI and Machine Learning in Radiotherapy QA Front Artif Intell 3 577620 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chang Z, Wang Z, Wu QJ, Yan H, Bowsher J, Zhang J and Yin FF 2008. Dosimetric characteristics of Novalis Tx system with high definition multileaf collimator Medical physics 35 4460–3 [DOI] [PubMed] [Google Scholar]
  14. Das IJ, Ding GX and Ahnesjö A 2008. Small fields: nonequilibrium radiation dosimetry Medical physics 35 206–15 [DOI] [PubMed] [Google Scholar]
  15. Derreumaux S, Etard C, Huet C, Trompier F, Clairand I, Bottollier-Depois J-F, Aubert B and Gourmelon P 2008. Lessons from recent accidents in radiation therapy in France Radiation protection dosimetry 131 130–5 [DOI] [PubMed] [Google Scholar]
  16. Ding GX, Duggan DM and Coffey CW 2007a. Comment on“Testing of the analytical anisotropic algorithm for photon dose calculation”[Med. Phys. 33, 4130–4148 (2006)] Medical physics 34 3414. [DOI] [PubMed] [Google Scholar]
  17. Ding GX, Duggan DM, Lu B, Hallahan DE, Cmelak A, Malcolm A, Newton J, Deeley M and Coffey CW 2007b. Impact of inhomogeneity corrections on dose coverage in the treatment of lung cancer using stereotactic body radiation therapy Medical physics 34 2985–94 [DOI] [PubMed] [Google Scholar]
  18. Drucker H 1997. Improving regressors using boosting techniques ICML 97 107–15 [Google Scholar]
  19. Fan J, Xing L, Ma M, Hu W and Yang Y 2020. Verification of the machine delivery parameters of a treatment plan via deep learning Phys Med Biol 65 195007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Followill DS, Kry SF, Qin L, Leif J, Molineu A, Alvarez P, Aguirre JF and Ibbott GS 2012. The Radiological Physics Center’s standard dataset for small field size output factors Journal of applied clinical medical physics 13 282–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Friedman JH 2001. Greedy function approximation: a gradient boosting machine Annals of statistics 1189–232 [Google Scholar]
  22. Glide-Hurst C, Bellon M, Foster R, Altunbas C, Speiser M, Altman M, Westerly D, Wen N, Zhao B and Miften M 2013. Commissioning of the Varian TrueBeam linear accelerator: a multi-institutional study Medical physics 40 031719 [DOI] [PubMed] [Google Scholar]
  23. Huq MS, Fraass BA, Dunscombe PB, Gibbons JP Jr., Ibbott GS, Mundt AJ, Mutic S, Palta JR, Rath F, Thomadsen BR, Williamson JF and Yorke ED 2016. The report of Task Group 100 of the AAPM: Application of risk analysis methods to radiation therapy quality management Medical Physics 43 4209–62 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. IAEA TRS-483 I A E A 2017. Dosimetry of Small Static Fields Used in External Beam Radiotherapy, Technical Report Series No. 483 (Vienna: INTERNATIONAL ATOMIC ENERGY AGENCY; ) [Google Scholar]
  25. Kerns JR, Followill DS, Lowenstein J, Molineu A, Alvarez P, Taylor PA, Stingo FC and Kry SF 2016. Technical report: reference photon dosimetry data for Varian accelerators based on IROC-Houston site visit data Medical physics 43 2374–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Khan FM and Gibbons JP 2014. Khan’s the physics of radiation therapy: Lippincott Williams & Wilkins; ) [Google Scholar]
  27. Lechner W, Primeßnig A, Nenoff L, Wesolowska P, Izewska J and Georg D 2020. The influence of errors in small field dosimetry on the dosimetric accuracy of treatment plans Acta Oncologica 59 511–7 [DOI] [PubMed] [Google Scholar]
  28. Lechner W, Wesolowska P, Azangwe G, Arib M, Alves VGL, Suming L, Ekendahl D, Bulski W, Samper JLA and Vinatha SP 2018. A multinational audit of small field output factors calculated by treatment planning systems used in radiotherapy Physics and Imaging in Radiation Oncology 5 58–63 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Li XA, Soubra M, Szanto J and Gerig L 1995. Lateral electron equilibrium and electron contamination in measurements of head-scatter factors using miniphantoms and brass caps Medical physics 22 1167–70 [DOI] [PubMed] [Google Scholar]
  30. McNiven AL, Sharpe MB and Purdie TG 2010. A new metric for assessing IMRT modulation complexity and plan deliverability Medical physics 37 505–15 [DOI] [PubMed] [Google Scholar]
  31. Murphy KP 2012. Machine learning: a probabilistic perspective: MIT press; ) [Google Scholar]
  32. Nomura Y, Wang J, Shirato H, Shimizu S and Xing L 2020. Fast spot-scanning proton dose calculation method with uncertainty quantification using a three-dimensional convolutional neural network Phys Med Biol 65 215007 [DOI] [PubMed] [Google Scholar]
  33. Palmans H, Andreo P, Huq MS, Seuntjens J, Christaki KE and Meghzifene A 2018. Dosimetry of small static fields used in external photon beam radiotherapy: Summary of TRS-483, the IAEA–AAPM international Code of Practice for reference and relative dose determination Med. Phys 45 e1123–e45 [DOI] [PubMed] [Google Scholar]
  34. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R and Dubourg V 2011. Scikit-learn: Machine learning in Python the Journal of machine Learning research 12 2825–30 [Google Scholar]
  35. Sánchez-Doblado F, Hartmann G, Pena J, Roselló J, Russiello G and Gonzalez-Castaño D 2007. A new method for output factor determination in MLC shaped narrow beams Physica medica 23 58–66 [DOI] [PubMed] [Google Scholar]
  36. Seuntjens J, Lartigau E, Cora S, Ding G, Goetsch S and Nuyttens J 2014. ICRU report 91. Prescribing, recording, and reporting of stereotactic treatments with small photon beams J ICRU 14 1–16027789590 [Google Scholar]
  37. Sharma DS, Chaudhary RK, Sharma SD, Pilakkal S, Rasal SK, Sawant MB and Phurailatpam RD 2017. Experimental determination of stereotactic cone size and detector specific output correction factor The British Journal of Radiology 90 [Google Scholar]
  38. Stock M, Kroupa B and Georg D 2005. Interpretation and evaluation of the γ index and the γ index angle for the verification of IMRT hybrid plans Physics in Medicine & Biology 50 399. [DOI] [PubMed] [Google Scholar]
  39. Sun B, Lam D, Yang D, Grantham K, Zhang T, Mutic S and Zhao T 2018. A machine learning approach to the accurate prediction of monitor units for a compact proton machine Medical physics 45 2243–51 [DOI] [PubMed] [Google Scholar]
  40. Timmerman R and Xing L 2009. Image Guided and Adaptive Radiation Therapy (Baltimore Lippincott Williams & Wilkins; ) [Google Scholar]
  41. Valdes G, Chan MF, Lim SB, Scheuermann R, Deasy JO and Solberg TD 2017. IMRT QA using machine learning: a multi-institutional validation Journal of applied clinical medical physics 18 279–84 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Wolfs CJ, Swinnen AC, Nijsten SM and Verhaegen F 2018. Should dose from small fields be limited for dose verification procedures?: uncertainty versus small field dose in VMAT treatments Physics in Medicine & Biology 63 20NT01 [DOI] [PubMed] [Google Scholar]
  43. Zhao W, Patil I, Han B, Yang Y, Xing L and Schüler E 2020. Beam data modeling of linear accelerators (linacs) through machine learning and its potential applications in fast and robust linac commissioning and quality assurance Radiotherapy and Oncology 153 122–9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES