Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Aug 18.
Published in final edited form as: Radiother Oncol. 2021 Jun 21;161:230–240. doi: 10.1016/j.radonc.2021.06.024

Commissioning and clinical implementation of an Autoencoder based Classification-Regression model for VMAT patient-specific QA in a multi-institution scenario

Ruijie Yang 1,#, Xueying Yang 2,#, Le Wang 3,4,#, Dingjie Li 5, Yuexin Guo 6, Ying Li 7, Yumin Guan 8, Xiangyang Wu 9, Shouping Xu 10, Shuming Zhang 1,11, Maria Chan 12, Lisheng Geng 2,13,*, Jing Sui 3,4,*
PMCID: PMC9388201  NIHMSID: NIHMS1826181  PMID: 34166717

Abstract

Background and purpose:

To commission and implement an Autoencoder based Classification-Regression (ACLR) model for VMAT patient-specific quality assurance (PSQA) in a multi-institution scenario.

Materials and methods:

1835 VMAT plans from seven institutions were collected for the ACLR model commissioning and multi-institutional validation. We established three scenarios to validate the gamma passing rates (GPRs) prediction and classification accuracy with the ACLR model for different delivery equipment, QA devices, and treatment planning systems (TPS). The prediction performance of the ACLR model was evaluated using mean absolute error (MAE) and root mean square error (RMSE). The classification performance was evaluated using sensitivity and specificity. An independent end-to-end test (E2E) and routine QA of the ACLR model were performed to validate the clinical use of the model.

Results:

For multi-institution validations, the MAEs were 1.30–2.80% and 2.42–4.60% at 3%/3mm and 3%/2mm, respectively, and RMSEs were 1.55–2.98% and 2.83–4.95% at 3%/3mm and 3%/2mm, respectively, with different delivery equipment, QA devices, and TPS, while the sensitivity was 90% and specificity was 70.1% at 3%/2mm. For the E2E, the deviations between the predicted and measured results were within 3%, and the model passed the consistency check for clinical implementation. And the predicted results of the model were the same in daily QA, while the deviations between the repeated monthly measured GPRs were all within 2%.

Conclusions:

The performance of the ACLR model in multi-institution scenarios was validated on a large scale. Routine QA of the ACLR model was established and the model could be used for VMAT PSQA clinically.

Keywords: Machine learning, VMAT patient-specific QA, Multi-institution validation, Commissioning, Clinical implementation

1. Introduction

Volumetric modulated arc therapy (VMAT) has better delivery efficiency than intensity-modulated radiation therapy (IMRT) [1, 2]. However, the planning and delivery complexities of VMAT require the treatment planning system (TPS), and delivery equipment should be as accurate as possible. To assure the efficacy and safety of patients, gamma analysis before treatment is an extensively used format in patient-specific QA (PSQA) for both IMRT and VMAT [3, 4].

Traditional PSQA uses the phantom with a detector for dosimetry measurement, which is time-consuming. Machine learning (ML) and deep learning (DL) for PSQA have been carried out to improve efficiency [5, 6]. Valdes et al. [7] predicted IMRT gamma passing rates (GPRs) with 78 plan complexity metrics using Poisson regression and Lasso regularization (PL). The predicted GPRs were within 3% at 3%/3mm. They also assessed the algorithm accuracy across two institutions and found only about 86.33% of plans with the prediction error within 3.5%. [8]. Ono et al. [9] trained the convolutional neural network (CNN) model with VMAT plans complexity metrics, and achieved much better results. Lam et al. [10] demonstrated that the GPRs could be predicted accurately with 31 features of plan complexity metrics and machine metrics using Ada-Boost, Random Forest (RF), and XGBoost algorithms. Granville et al. [11] also used features that describe Linac performance combined with treatment plan complexity characteristics to train ML models, resulting in improved prediction accuracy for VMAT PSQA.

Instead of predicting GPRs with plan complexity characteristics, fluence maps extracted from the treatment plans were also used for GPRs prediction. Interian et al. [12], Tomori et al. [13], Mahdavi et al. [14], and Kimura et al. [15] proposed the CNN model to predict IMRT or VMAT QA results. Wootton et al. [16] and Nyflot et al. [17] developed DL models to detect and distinguish specific errors with radiomics analysis of gamma images characteristics.

The potential of ML/DL models in GPRs prediction has been studied [6]. In order to decide whether the plans can be accurately delivered clinically, we have investigated the classification performance of the ML model at different gamma criteria and action limits to classify the plans as “pass” or “fail” for clinical implementation with VMAT plans [18]. We proposed the PL model to predict GPRs and RF model to classify the plans, which was the first study for VMAT QA to the classification of plans. We further developed an autoencoder based Classification-Regression (ACLR) to accomplish two tasks using one model, and validated it for VMAT PSQA in a single-institution [19]. The ACLR model performed better in prediction than the previous PL model [18]. The mean absolute error (MAE) of ACLR was 1.76% at 3%/3mm and 2.60% at 3%/2mm. For classification, the ACLR model sensitivity was 100% and specificity was 83% at 3%/3mm, and the sensitivity was better than the PL model.

The ACLR model has shown good performance in a single-institution, however, its performance has not been proved in a multi-institution scenario so as to be applied in clinics. If the ACLR model is applied to clinical practice, we will first consider how the inter-institutional variability affects the performance of the DL models, and how the DL models trained with data from one institution perform in another institution. It is necessary to verify the accuracy of the ACLR model in multiple scenarios for different delivery equipment, QA devices, and TPS in different institutions. In addition, to ensure safe and clinical implication, a rigorous commissioning and QA program for AI-based applications should be established [20].

The aims of this study are to validate the accuracy of the ACLR model using a large and heterogeneous dataset varied in delivery equipment, QA devices, and TPS from multiple institutions, and investigate the commissioning, clinical implementation, and routine QA of the ACLR model, to promote the widespread application of AI-based PSQA. This is the first study on commissioning and clinical implementation of an AI-based IMRT QA approach as far as we know.

2. Materials and methods

2.1. Overview of clinical implementation of an Autoencoder based Classification-Regression model for VMAT patient-specific QA in a multi-institution scenario

The process for introducing the ACLR model into clinical practice started with commissioning, followed by clinical implementation and QA. As shown in Figure. 1, the commissioning included the training/validation and test of the model, and an independent end-to-end test (E2E) as well as routine QA. The training/validation and test process were performed as described in section 2.2.2 to tune and test the ACLR model periodically. The E2E was designed to evaluate the final performance of the ACLR model. If the analysis of the model in the training/validation and test process showed potential risks of the model, the model should be re-trained. The model should also be updated once the potential risks were found in the E2E.

Figure 1.

Figure 1.

Workflow for the commissioning, implementation, and QA of the ACLR model for the VMAT PSQA.

2.1.1. Clinical data collection

1835 dual-arc VMAT plans from seven institutions were collected. Among these plans, 795 (43.3%) were gynecological cancer plans; 368 (20.1%) were head and neck cancer plans; 312 (17.0%) were rectal cancer plans; 130 (7.1%) were brain cancer plans; 83 (4.5%) were prostate cancer plans; 63 (3.4%) were thoracic cancer plans; 55 (3.0%) were other abdominal cancer plans;14 (0.8%) were other pelvic cancer plans; and 15 (0.8%) were other cancer plans. All data were anonymized for privacy and collected after approval by the ethical committees of participating institutions. The informed consent was waived due to the retrospective nature of the study, which was also approved by the ethics committees. As shown in Appendix A. Table A.1, the VMAT plans were generated with Eclipse™ (Varian Medical Systems, Palo Alto, CA, USA) or RayStation™ (RaySearch Medical Laboratories AB, Stockholm, Sweden) TPS, and delivered with either Trilogy, Unique, Clinac iX, TrueBeam, or VitalBeam Linacs. Dosimetric measurements were performed with one of the three QA devices: MatriXX (IBA Dosimetry, Schwarzenbruck, Germany), EPID (Varian Medical System, Palo Alto, CA, USA), and ArcCHECK (Sun Nuclear, Melbourne, FL, USA). The delivery equipment, QA devices, and TPS, as well as the distribution of datasets in each institution, are listed in Appendix A. Table A.1.

The measured GPRs were set as a reference in this study. We set the action limits of 3%/3mm,3%/2mm, and 2%/2mm as 90%, 90%, and 80%, respectively [19], and analyzed the distribution of measured GPRs for the 1835 collected plans, which are listed in Appendix A. Table A.2.

The quality and standardization of all plans in this study can be guaranteed. We have investigated and analyzed the national status of IMRT PSQA in China [21], and formulated the practice guideline of IMRT PSQA based on the national survey, multicenter validation test, and expert discussion, which improved the standardization of IMRT PSQA [22].

2.2. Commissioning

2.2.1. Model

The ACLR model was proposed to combine prediction and classification in a single multi-task model. The prediction and classification models shared the same main network structure, which simplified the architecture of the ACLR model and accelerated the training process. The network structure of the ACLR model is shown in Figure 2. The specific description of the ACLR model was reported in our previous study [19]. To obtain the optimal hyperparameters and fit the model, 5-fold cross-validation was used for model training. The early model was built based on data from a single-institution.

Figure 2.

Figure 2.

Schematic of Autoencoder based Classification-Regression (ACLR) framework. Dataset X is used in training process and dataset Y is used in testing process. Datasets X and Y have used the same data processing.

In order to use as much data as possible to validate the classification performance, we proposed the Multi-Sites Variational Autoencoders CycleGAN model (MSVCGAN) as shown in Appendix B. Figure B.1 and trained a classification model that combined MSVCGAN with ACLR models as shown in Appendix B. Figure B.2. The details of MSVCGAN were given in Appendix B.

54 metrics that reflected the modulation complexity of VMAT plans were extracted as inputs to the ACLR model using Matlab. A full summary of the metrics was given in Appendix C. Table C.1 [18]. The model outputs were the predicted GPRs and classification results.

2.2.2. Model training, validation, and test

We divided the datasets into eight groups to verify the prediction accuracy of the ACLR model in Scenario a) b) and c) including different delivery equipment, QA devices, and TPS, respectively:

  1. Scenario a) was divided into four groups (Group A to Group D), all of which used the same dataset delivered by Trilogy to train the ACLR model. And the dataset used for testing of the model delivered by Trilogy, iX, TrueBeam, and VitalBeam, respectively, to validate the prediction performance of the ACLR model for the delivery equipment.

  2. Scenario b) was divided into three groups (Group E, Group F, and Group G). The dataset used for training of the model acquired with MatriXX, and the data used for testing of the model acquired with EPID and ArcCHECK, respectively, to validate the performance of transfer learning of the ACLR model for different QA devices. In addition, the dataset acquired with EPID for model training, and the dataset acquired with ArcCHECK for model testing also developed in this scenario.

  3. Scenario c) was validated by Group H. Dataset used for training of the model generated by Eclipse, and the dataset used for testing of the model generated by RayStation to validate the performance of transfer learning of the ACLR model.

    In Scenario a) and b), we only used the datasets generated by Eclipse due to the effect of different dose calculation algorithms built-in Eclipse TPS and Raystation TPS on the model, which has been confirmed in Scenario c).

  4. Scenario d) was divided into two groups (Group I and Group J) to validate the prediction performance of the ACLR model for a single-institution, which used the datasets from Institution #3 and Institution #7, respectively.

Other details of the groups for the training/validation set and testing set are given in Table 1.

Table 1.

Groups for testing the prediction accuracy of the ACLR models in multiple scenarios for different delivery equipment, QA devices, and TPS as well as groups for the prediction accuracy of the ACLR models in an individual-institution.

a) Scenario for different delivery equipment

Group Institution QA device TPS Delivery equipment # of VMAT plans

A Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 7 EPID Eclipse Trilogy 271

B Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 1 ArcCHECK Eclipse iX 65

C Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 2 EPID Eclipse TrueBeam 143

D Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 6 EPID Eclipse VitalBeam 148

b) Scenario for different QA devices

Group Institution QA device TPS Delivery equipment # of VMAT plans

E Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 5,7 EPID Eclipse Unique, Trilogy 339

F Training and validation set 4 MatriXX Eclipse Trilogy 649
Testing set 1,3 ArcCHECK Eclipse iX, Unique 80

G Training and validation set 5,7 EPID Eclipse Unique, Trilogy 339
Testing set 1,3 ArcCHECK Eclipse iX, Unique 80

c) Scenario for different TPSs

Group Institution QA device TPS Delivery equipment # of VMAT plans

H Training and validation set 1, 3, 4, 5, 7 MatriXX, EPID, ArcCHECK Eclipse Unique, Trilogy, iX 1068
Testing set 3, 5 EPID, ArcCHECK RayStation Unique 252

d) Scenario for single-institution

Group Institution QA device TPS Delivery equipment # of VMAT plans

I Training, validation and testing set 3 ArcCHECK EPID Eclipse RayStation Unique TrueBeam 369

J Training, validation and testing set 7 EPID Eclipse Trilogy 271

For classification, the Eclipse plans from Institutions #1–6 were used for model training and validation, and the plans from Institution #7 were used for testing only.

2.2.3. Evaluation of prediction/classification performance

We evaluated the prediction accuracy of the ACLR models with MAE and RMSE (root-mean-square error) between the predicted and measured GPRs, and evaluated the classification accuracy with sensitivity and specificity. Sensitivity can identify plans that can fail accurately while specificity can identify plans that may pass. For QA results classification, action limits for global 3%/2mm and 2%/2mm with a low dose threshold of 10% were 90% and 80%, respectively.

2.3. Clinical implementation

Prior to clinical implementation, the physicists responsible for VMAT QA were trained on the proper use of the model and relevant interpretation. Twenty VMAT plans were prospectively collected for an independent E2E. The measurement-based VMAT QA was conducted with a MatriXX ionization chamber array inserted in a Multicube phantom. The ACLR model was used to predict the GPRs of the twenty plans. The errors between the predicted and measured results were considered “Acceptable” if they were both within 3%.

After the E2E, the model was applied clinically. The GPRs were predicted using the commissioned model for 20 new plans. If the model output was “Pass”, the new plans were verified using a secondary independent dose calculation algorithm and approved for delivery. If the model output was “Fail”, the measurement-based QA was performed.

2.4. Quality assurance

After the successful launch of the ACLR model-based VMAT QA to the clinic, daily and monthly QA of the model were performed. For daily QA, the trained model predicted the VMAT plans twice with the given GPRs results to test the stability of the model. For monthly QA, the 20 standard VMAT plans used for the independent E2E were re-measured as QA for the PSQA program to ensure the consistency of the clinical QA devices and delivery equipment. The current and past month’s GPRs and predicted GPRs were compared as the quality control (QC) of the ACLR model. It was considered as “Acceptable” for the consistency check when the difference of crosscheck was within 2%.

3. Results

The prediction accuracy of the generalized ACLR model for different delivery equipment, QA devices, and TPS in multi-institution, and the prediction accuracy of the model in an individual-institution were given in Table 2, which was better than or comparable with the results of the early model for a single-institution. ACLR_s model represents the ACLR model for a single-institution in our previous work, and ACLR-n (n=A to J) represents the ACLR models for different groups.

Table 2.

Prediction accuracy of the ACLR models for different delivery equipment, QA devices, TPS as well as the prediction accuracy of the ACLR models in an individual-institution.

Scenario Evaluation index DL model Gamma criteria
3%/3mm 3%/2mm

MAE (%) ACLR_s 1.76 2.60

RMSE (%) ACLR_s 2.50 3.50

Scenario a) MAE (%) ACLR-A 1.54 3.30
ACLR-B 1.73 2.42
ACLR-C 2.52 4.42
ACLR-D 2.80 4.60

RMSE (%) ACLR-A 2.03 4.41
ACLR-B 2.13 2.83
ACLR-C 2.83 4.95
ACLR-D 2.98 4.83

Scenario b) MAE (%) ACLR-E 1.83 3.30
ACLR-F 1.95 2.69
ACLR-G 1.30 3.06

RMSE (%) ACLR-E 2.25 4.42
ACLR-F 2.34 3.17
ACLR-G 1.55 3.63

Scenario c) MAE (%) ACLR-H 2.20 3.12

RMSE (%) ACLR-H 2.66 3.77

Scenario d) MAE (%) ACLR-I 1.62 2.44
ACLR-J 1.38 2.80

RMSE (%) ACLR-I 2.13 3.29
ACLR-J 2.02 3.94

Abbreviations: DL=Deep learning; MAE = Mean absolute error; RMSE = Root-mean-square error; ACLR_s = Autoencoder based Classification-Regression deep learning model for a single-institution; ACLR-n (n=A to J) = ACLR models for different groups

In Scenario a), the MAEs were 1.5% and 1.73% at 3%/3mm for Group A and Group B, respectively, which was better than that of ACLR_s model. And the RMSEs were 2.83% and 2.98% at 3%/3mm for Group C and Group D, respectively, being comparable with the ACLR_s model. MAEs in Scenario b) were 1.83%, 1.95%, and 1.30% at 3%/3mm for Group E, Group F, and Group G, respectively. The RMSEs were 2.25%, 2.34%, and 1.55% at 3%/3mm for Group E, Group F, and Group G, respectively. The prediction accuracy was better than or comparable with the ACLR_s model. In Scenario c), the MAEs were 2.20% and 3.12% at 3%/3mm and 3%/2mm, respectively, and RMSEs were 2.66% and 3.77% at 3%/3mm and 3%/2mm, respectively. In Scenario d), MAEs were 1.62% and 1.38% for Group I and Group J, respectively. RMSEs were 2.13% and 2.02% for Group I and Group J, respectively.

In the multi-institution ACLR model, the sensitivity and specificity were 90% and 70.1% at 3%/2mm, respectively, and the sensitivity and specificity were 94.3% and 81.4% at 2%/2mm, respectively, while in the ACLR_s model, the sensitivity and specificity were 92% and 69% at 3%/2mm, respectively, and the sensitivity and specificity were 100% and 67% at 2%/2mm, respectively. ROC (Receiver operating characteristic) curves at 3%/2mm and 2%/2mm were shown in Figure 3, and AUCs (Area under ROC curve) were 0.812 and 0.869 at 3%/2mm and 2%2mm, respectively.

Figure 3.

Figure 3.

ROC curves for the model at 3%/2mm (a) and 2%/2mm (b).

For daily QA, the predicted results of the VMAT plans were the same twice. The results of the E2E and monthly QA were shown in Figure 4. The results showed that the absolute differences between the predicted and measured results were all within 3% for the E2E. After E2E test, 50 plans were used in implementation at 3%/2mm, there has been one plan that failed (GPR was 89.56%) among 50 plans. We have re-measured these plans using QA devices, and the re-measured GPR of the failed plan was 88.96%. The MAE was 2.01%, and RMSE was 1.42%. For monthly QA, the results of 20 VMAT plans re-measured were also presented in Figure 4. The errors between the repeated monthly measured GPRs were all within 2%.

Figure 4.

Figure 4.

The results of the E2E and monthly QA of standard VMAT plans.

4. Discussion

An ACLR model for VMAT patient-specific QA was commissioned and clinically implemented. The performance of the ACLR model was validated in multiple institutions using different delivery equipment, QA devices, and TPS.

Several studies [2329] have shown that measured GPRs are closely related to different delivery equipment, QA devices, and TPS. If the measured GPRs are inconsistent, the plan complexity metrics trained the ML model will be inconsistent, so it is hard for the ML model to find correlations between plan complexity metrics and GPRs, which affects the prediction accuracy of the model. It was confirmed that the prediction accuracy of the ML model involving two different QA devices was slightly lower than the ML model involving a single QA device in the previous studies [7, 8]. In this study, the impact of three different QA devices was investigated together with delivery equipment and TPS.

Delivery equipment may impact the prediction accuracy of the ACLR model. In Scenario a), VMAT plans used in training sets were all delivered with Trilogy, and plans used in testing sets for Groups A, B, C, and D were delivered with Trilogy, Clinic iX, TrueBeam, and VitalBeam, respectively. The study of Kerns et al. [24] has shown the dose delivery characteristics are also affected by different delivery equipment. For VMAT deliveries, they found that in addition to the average and maximum leaf speed, the mean gantry speed was also correlated to MLC performance and VMAT delivery. The gantry speed of Trilogy and Clinac iX is the same with the maximum speed of 4.8°/s, thus having little impact on the 54 complexity characteristic parameters extracted and ensuring the consistency of the training set and the testing set. But the maximum gantry speed of TrueBeam and VitalBeam are 6°/s, which makes the complexity characteristics parameters in the testing set different from those in the training set. Different gantry speed significantly impacts the calculation of complexity characteristics parameters including modulation index for leaf speed and acceleration, leaf speed and acceleration, mean leaf speed and acceleration, the standard deviation of leaf speed and acceleration, mean dose rate, and standard deviation of dose rate. Thus, different delivery equipment may cause differences in the features learned by the ACLR model, which is the reason for the poor results when different delivery equipment was used for training set and testing set, respectively.

Scenario b) showed that the QA devices in this study have little impact on the prediction accuracy of the ACLR model as the metrics used include all those impactful features for those QA devices. In this scenario, the error of prediction results may be related to the variability of measured GPR results for different QA devices. Hussein et al. [30] investigated that the measurement results of different QA devices are inconsistent. Different devices showed different levels of consistency with the predicted gamma analysis at the same gamma criteria. Results in Scenario b) showed that QA devices have little impact on the prediction accuracy of the ACLR model. Thus, using different QA devices for model training could be used for different QA devices.

The prediction accuracy of the ACLR model was highly correlated with TPS. The error of GPRs prediction may result from the different dose calculation algorithms embedded in Eclipse TPS and RayStation TPS and the gantry angle sampling. The dose calculation algorithm embedded in TPS and beam modeling parameters are distinct among different TPS, which makes the difference in TPS commissioning. In addition, the gantry angle sampling of plans were 2° and 4° between the control points for Eclipse and RayStation, respectively, which made the complexity characteristics parameters including modulation index for total modulation, mean and standard deviation of dose rate significantly different between Eclipse plans and RayStation plans. So, the plan complexity characteristics parameters extracted from VMAT plans and measured GPRs varied among different TPS, which impacts the prediction accuracy of the ACLR model. The classification accuracy of the ACLR model for the unbalanced distribution of data has been verified in a single-institution. GPRs of most VMAT plans were higher, and only a few VMAT plans’ GPRs were lower in the previous studies [7, 8, 12, 13, 18]. In this study, the distribution of the measured GPRs of the VMAT plans also followed a skewed distribution rather than a normal distribution. In order to balance the number of the size of positive and negative plans and improve the classification accuracy, the balanced sampling technique was used in the training process. However, the accuracy of classification was still unperfect in multi-institutions, which needs improvement in the future.

To safely introduce the artificial intelligence-based model into clinical practice, systematic validation and tests should be completed. Clinical implementation and the consistency check for the ACLR model in PSQA have been conducted, which is the first study to investigate the commissioning, clinical implementation, and routine QA of AI-based approach in IMRT QA to our best knowledge. The actionable recommendations for the clinical application provide a reference for the clinical use of the model in PSQA. The institutions that use the ACLR model for PSQA need to train relevant personnel and conduct an E2E of the model. A systematic QA program should be established and strictly followed. We recommend that each institution measure 3–5 representative disease-site plans chosen from the standard plans at the E2E (in our case it’s 20-plan) with known GPRs per month using their respective QA devices and predict GPRs of these plans with the ACLR model. Moreover, we also recommend that 3–5 new representative disease-site plans could be selected each month to accomplish the same task to ensure the adaptability of the model to the new plans. This would serve as QC of their ACLR model for reproducibility/stability, thus, ensuring the integrity of their virtual PSQA program. In addition to the daily and monthly QA performed in this study, annual QA should also be considered. For annual or after any major upgrade or repair of delivery equipment, QA devices, or TPS, a set of new plans consisting of representative sites should be used to re-validate the model. If necessary, the model needs to be re-trained with the new datasets. Continuous and systematic record and analysis of the model performance and potential risks for implementation program can be beneficial to update and perfect the model.

5. Conclusion

The generalized ACLR model could accurately predict PSQA results in multi-institution scenarios and its performance was better than or comparable with the early model for a single-institution. The new ACLR model can be deployed to the clinic with proper commissioning and periodical QA/QC to help reduce medical physicists’ workload.

Highlights:

  • The first study on commission and clinical implementation of AI model for IMRT QA

  • The biggest multi-institution dataset for AI model training/validation and test

  • An executable QA program was proposed for the AI-based IMRT QA for the first time

  • This study proved the AI model can be used for IMRT QA clinically

  • A classification-regression integrated AI model developed with better performance

Acknowledgments

The authors thank Danhong Ding from the Affiliated Cancer Hospital of Zhengzhou University, Jinyan Hu from the First Affiliated Hospital of Zhengzhou University, Wenli Lu and Xin Yi from The First Affiliated Hospital of Chongqing Medical University, Wei Zhang from Yantai Yuhuangding Hospital, Qiang Zhao from Shanxi Provincial Cancer Hospital, Jinyuan Wang from General Hospital of the People’s Liberation Army, and Xile Zhang and Qilin Zhang from Peking University Third Hospital.

Funding Source:

This work was partly supported by Beijing Municipal Commission of science and technology collaborative innovation project (Z201100005620012 and Z181100001518005), Capital’s Funds for Health Improvement and Research (2020-2Z-40919), National Natural Science Foundation of China (No. 81071237, No. 11735003, No. 11975041, No. 11961141004, No. 61773380, and No. 82022035), the Natural Science Foundation of Beijing (7202223), Key project of Henan Provincial Department of Education (20B320035), and the NIH/NCI P30 Cancer Center Support Grant (No. CA008748).

The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Appendix A.

Table A.1.

Delivery equipment, QA devices, TPS, and dose calculation algorithms used, and the distribution of VMAT plans in seven institutions.

TPS Delivery equipment QA device Institution
1 2 3 4 5 6 7

Eclipse Trilogy MatriXX 649
EPID 271
iX ArcCHECK 65
Unique EPID 68
ArcCHECK 15
TrueBeam EPID 143 156
ArcCHECK 19
VitalBeam EPID 148
RayStation Unique EPID 122
ArcCHECK 130
TrueBeam ArcCHECK 49
Dose calculation algorithm AAA AAA AAA/CCC AAA AAA/CCC AXB AAA/AXB
Total 65 143 369 649 190 148 271

Table A.2.

The distribution of GPRs (Pass/Fail) at each gamma criteria.

Gamma criteria 3%/3mm
(Action limit: 90%)
3%/2mm
(Action limit: 90%)
2%/2mm
(Action limit: 80%)



Pass Fail Pass Fail Pass Fail

Institution 1 65 0 65 0 65 0
Institution 2 143 0 143 0 143 0
Institution 3 368 1 337 32 314 55
Institution 4 606 43 561 88 550 99
Institution 5 190 0 187 3 188 2
Institution 6 148 0 148 0 148 0
Institution 7 270 1 241 30 236 35

Appendix B.

MSVCGAN

In order to use as much data as possible to validate the classification performance, we proposed the Multi-Sites Variational Autoencoders CycleGAN model (MSVCGAN) as shown in Appendix B. Figure. B.1 and trained a classification model that combined MSVCGAN with ACLR models as shown in Appendix B. Figure. B.2. CycleGAN [1] is an unsupervised method to learn the mapping between two datasets. Three datasets were used to establish three CycleGAN models, and the details of the model forward training process are depicted in Appendix B. Figure. B.1. Variational Autoencoders (VAE) as the structure of the generator realize the mapping between two datasets. The reconstructed data is the output of the VAE decoder as shown in Appendix B. Figure. B.1 and Appendix B. Figure. B.2. After training, its distribution is very close to the input data. The reconstruction loss between reconstructed data and input data was used to train the VAE. As shown in Appendix B. Figure. B.1, fake data is the output of the generator in the generative adversary network (GAN). The fake data and input data (also called real data) were used to train GAN. Discriminator A corresponds to VAE A and distinguishes the real data A from fake data B and C, and vice versa.

The loss function of the models included reconstruction loss LVAE for the VAE part, adversarial loss LGAN and cycle loss LCycle for the GAN part [1], and classification loss LC and regression loss (MSE Loss) LR for the ACLR part.

For the classification loss, a new loss function LC was used to improve the performance by MSE Loss:

LC=MSE[p1,1]+MSE[p0,0]

Targeti(x) is the ith GPR of data x, THi is the ith action limit of data x. For p, p1, and p0, we have,

p=1eα*|Targeti(x)THi|,
p1=β+(1β)*p,
p0=(1β)(1β)*p.

e is the exponential function. α is a positive number, here α = log (10).

β is a positive number between 0.5 and 0.9, here β = 0.6.

Therefore, the range of p1 is 0.6 to 1, and the range of p0 is 0 to 0.4.

As we know that eα*|Targeti(x)THi|<e0=1, p is a positive number between 0 and 1. When Targeti(x) and THi are quite close, p is close to 0. When Targeti(x) and THiare quite different, p is close to 1. Therefore, p can be regarded as our confidence that a certain sample belongs to a particular class. We tried a series of hyperparameters, βfrom 0.5 to 0.9, and found that β = 0.6 can get the best classification performance. After defining p and β, p1 and p0 can be calculated. p1 describes the probability that a certain sample belongs to positive class. p0 describes the probability belong to a negative class. For every batch of samples, p1 is calculated for the positive class only (range of 0.5 to 0.9) and p0 is calculated for the negative class only (range of 0 to 0.4). The loss LC uses the energy function which makes the calculation of the classification loss smoother and has a smoother loss gap between positive and negative classes.

Figure B.1.

Figure B.1.

The forward training process of MSVCGAN. Three VAEs based generators are used to generate the fake data and reconstructed data. The discriminators are used to determine whether a sample is real data or fake data. The yellow arrow indicates the direction of data flow.

Figure B.2.

Figure B.2.

The structure of MSVCGAN (Generator, Discriminator) connected with ACLR model (MSVCGAN-ACLR). After training the MSVCGAN, the hidden state z of generator X is used as the feature of the ACLR model. The detailed structure of the generator and discriminator parameters are also displayed in this section.

Reference

  • 1.Zhu JY, Park T, Isola P, Efros AA, Ieee. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017 Ieee International Conference on Computer Vision, Ieee, New York, 2017, 2242–2251. [Google Scholar]

Appendix C.

54 complexity metrics in our study contained aperture area metrics, aperture complexity metrics, leaf speed metrics, dose metrics, and metrics read from TPS.

Aperture area metrics contained union aperture area (UAA) and small aperture score (SAS), edge area metrics (EAM), and leaf gap (average leaf gap, ALG, and standard deviation of leaf gap, SLG). UAA was the union area of all individual apertures in each arc, and weighted by the number of MU delivered. SAS was used to calculate the proportions of apertures defined as small where the MLC leaf separation was less than a certain value (5, 10, and 20 mm). EAM was the proportion of penumbra area to field area. Leaf gap was the positional relationship between a pair of leaves.

Aperture complexity metrics contained modulation complexity metrics (MCS), edge metric (EM), plan irregularity (PI), plan modulation (PM), converted aperture metric (CAM), circumference/area (C/A), and mean asymmetry distance (MAD). MCS combined leaf sequence variability and aperture shape modulation into a metric. EM was the ratio of the aperture perimeter defined by the MLC leaf sides to the aperture area. PI and C/A represented the complexity of the shape of the field in the plan. PM reflected the degree of dispersion of subfields. CAM represented the complexity of field modulation. MAD represented the offset distance of the center of the field from the center axis of the field.

Leaf speed metrics contained average leaf speed (ALS), standard deviation of leaf speed (SLS), average leaf acceleration (ALA), standard deviation of leaf acceleration (SLA), leaf travel distance (LT), and modulation index of leaf speed and leaf acceleration, which all reflected the features of leaf speed.

Dose metrics contained plan-normalized monitor units (PMU), dose rate (average dose rate, ADR and standard deviation of dose rate, SDR), gantry speed (average gantry speed, AGS and standard deviation of gantry speed, SGS), modulation index for total modulation. PMU was computed by dividing the total MU of VMAT plans by the fractional target dose and then multiplying by 2 Gy. Dose rate reflected the degree of dose rate variation, and gantry speed reflected the degree of gantry speed variation. Modulation index for total modulation considered the variation of speed, gantry speed, and dose rate in the VMAT plans.

Table C.1.

Summary of complexity metrics used in this study.

Number Metrics Reference

1 Modulation index for leaf speed f=2 (MIs 2) [1]
2 Modulation index for leaf speed f=1 (MIs 1) [1]
3 Modulation index for leaf speed f=0.5 (MIs 0.5) [1]
4 Modulation index for leaf speed f=0.2 (MIs 0.2) [1]
5 Modulation index for leaf acceleration f=2 (MIa 2) [1]
6 Modulation index for leaf acceleration f=1 (MIa 1) [1]
7 Modulation index for leaf acceleration f=0.5 (MIa 0.5) [1]
8 Modulation index for leaf acceleration f=0.2 (MIa 0.2) [1]
9 Modulation index for total modulation f=2 (MIt 2) [1]
10 Modulation index for total modulation f=1 (MIt 1) [1]
11 Modulation index for total modulation f=0.5 (MIt 0.5) [1]
12 Modulation index for total modulation f=0.2 (MIt 0.2) [1]
13 Proportion of leaf speed ranging from 0 to 0.4 cm/s (S0–0.4) [2]
14 Proportion of leaf speed ranging from 0.4 to 0.8 cm/s (S0.4–0.8) [2]
15 Proportion of leaf speed ranging from 0.8 to 1.2 cm/s (S0.8–1.2) [2]
16 Proportion of leaf speed ranging from 1.2 to 1.6 cm/s (S1.2–1.6) [2]
17 Proportion of leaf speed ranging from 1.6 to 2.0 cm/s (S1.6–2) [2]
18 Proportion of leaf acceleration ranging from 0 to 1 cm/s2 (A0–1) [2]
19 Proportion of leaf acceleration ranging from 1 to 2 cm/s2 (A1–2) [2]
20 Proportion of leaf acceleration ranging from 2 to 4 cm/s2 (A2–4) [2]
21 Proportion of leaf acceleration ranging from 4 to 6 cm/s2 (A4–6) [2]
22 Average leaf speed (ALS) [2]
23 Standard deviation of leaf speed (SLS) [2]
24 Average leaf acceleration (ALA) [2]
25 Standard deviation of leaf acceleration (SLA) [2]
26 Small aperture score 5mm (SAS 5mm) [3]
27 Small aperture score 10mm (SAS 10mm) [3]
28 Small aperture score 20mm (SAS 20mm) [3]
29 Mean asymmetry distance (MAD) [3]
30 Modulation complex score (MCS) [4]
31 Leaf sequence variability (LSV) [4]
32 Aperture area variability (AAV) [4]
33 Plan area (PA) [5]
34 Plan irregularity (PI) [5]
35 Plan modulation (PM) [5]
36 Plan normalized MU (PMU) [5]
37 Union aperture area (UAA) [5]
38 Edge metric (EM) [6]
39 Converted aperture metric (CAM) [7]
40 Edge area metric (EAM) [7]


Number Metrics Reference

41 Circumference/area (C/A) [7]
42 Average leaf travel distance (LT) [8]
43 Combination of LT and MCS (LTMCS) [8]
44 Average leaf gap (ALG) [9]
45 Standard deviation of leaf gap (SLG) [9]
46 Average dose rate (ADR) -
47 Standard deviation of dose rate (SDR) -
48 MU value in first arc (MU 1) -
49 MU value in second arc (MU 2) -
50 Prescribed dose to primary target per fraction (Dose) -
51 Field length at X direction in first arc (Field X1) -
52 Field length at Y direction in first arc (Field Y1) -
53 Field length at X direction in second arc (Field X2) -
54 Field length at Y direction in the second arc (Field Y2) -

Complexity metrics that have “−” in the reference column can be easily extracted or calculated based on plan information in the TPS.

Reference

1.

Park JM, Park SY, Kim H, Kim JH, Carlson J, Ye SJ. Modulation indices for volumetric modulated arc therapy. Phys Med Biol. 2014; 59(23):7315–7340.

2.

Park JM, Wu HG, Kim JH, et al. The effect of MLC speed and acceleration on the plan delivery accuracy of VMAT. Br J Radiol 2015; 88(1049):20140698.

3.

Crowe SB, Kairn T, Middlebrook N, et al. Examination of the properties of IMRT and VMAT beams and evaluation against pre-treatment quality assurance results. Phys Med Biol 2015; 60(6):2587–2601.

4.

McNiven AL, Sharpe MB, Purdie TG. A new metric for assessing IMRT modulation complexity and plan deliverability. Med Phys 2010; 37(2):505–515.

5.

Du W, Cho SH, Zhang X, et al. Quantification of beam complexity in intensity-modulated radiation therapy treatment plans. Med Phys 2014; 41(2):021716.

6.

Younge KC, Matuszak MM, et al. Penalization of aperture complexity in inversely planned volumetric modulated arc therapy. Med Phys 2012; 39(11):7160–7170.

7.

Götstedt J, Karlsson HA, Bäck A. Development and evaluation of aperture-based complexity metrics using film and EPID measurements of static MLC openings. Med Phys 2015; 42(7):3911–3921.

8.

Masi L, Doro R, Favuzza V, et al. Impact of plan parameters on the dosimetric accuracy of volumetric modulated arc therapy. Med Phys 2013; 40(7):071718.

9.

Nauta M, Villarreal-Barajas JE, Tambasco M. Fractal analysis for assessing the level of modulation of IMRT fields. Med Phys 2011; 38(10):5385–5393.

Footnotes

Conflict of interest

The authors state that they have no conflict of interest concerning this study.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Reference

  • [1].Popescu CC, Olivotto IA, Beckham WA, Ansbacher W, Zavgorodni S, Shaffer R, Wai ES, Otto K. Volumetric modulated arc therapy improves dosimetry and reduces treatment time compared to conventional intensity-modulated radiotherapy of left-sided breast cancer and internal mammary nodes. Int. J. Radiat. Oncol. Biol. Phys, 2010; 76:287–295. [DOI] [PubMed] [Google Scholar]
  • [2].Nicolini G, Ghosh-Laskar S, Shrivastava SK, Banerjee S, Chaudhary S, Agarwal JP, Munshi A, Clivio A, Fogliata A, Mancosu P, Vanetti E, Cozzi L. Volumetric Modulation Arc Radiotherapy With Flattening Filter-Free Beams Compared With Static Gantry IMRT and 3D Conformal Radiotherapy for Advanced Esophageal Cancer: A Feasibility Study, Int. J. Radiat. Oncol. Biol. Phys, 2012; 84:553–560. [DOI] [PubMed] [Google Scholar]
  • [3].Klein EE, Hanley J, Bayouth J, Yin F-F, Simon W, Dresser S, Serago C, Aguirre F, Ma L, Arjomandy B, Liu C, Sandin C, Holmes T. Task Group 142 report: Quality assurance of medical accelerators, Med Phys.2009; 36:4197–4212. [DOI] [PubMed] [Google Scholar]
  • [4].Smilowitz JB, Das IJ, Feygelman V, Fraass BA, Geurts M, Kry SF, Marshall IR, Mihailidis DN, Ouhib Z, Ritter T, Snyder MG, Fairobent L. A. Staff, AAPM Medical Physics Practice Guideline 5.a.: Commissioning and QA of Treatment Planning Dose Calculations - Megavoltage Photon and Electron Beams, J. Appl. Clin. Med. Phys, 2016; 17:457–457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Kalet AM, Luk SMH, Phillips MH. Radiation therapy quality assurance tasks and tools: the many roles of machine learning. Med. Phys. 2020; 47:e168–e177. [DOI] [PubMed] [Google Scholar]
  • [6].Chan MF, Witztum A, Valdes G. Integration of AI and machine learning in radiotherapy QA. Front. Artif. Intell. 2020; 3:577620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Valdes G, Scheuermann R, Hung CY, Olszanski A, Bellerive M, Solberg TD. A mathematical framework for virtual IMRT QA using machine learning, Med Phys. 2016; 43:4323–4334. [DOI] [PubMed] [Google Scholar]
  • [8].Valdes G, Chan MF, Lim SB, Scheuermann R, Deasy JO, Solberg TD. IMRT QA using machine learning: A multi-institutional validation, J. Appl. Clin. Med. Phys. 2017;18:279–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Ono T, Hirashima H, Iramina H, Mukumoto N, Miyabe Y, Nakamura M, Mizowaki T. Prediction of dosimetric accuracy for VMAT plans using plan complexity parameters via machine learning, Med Phys. 2019; 46:3823–3832. [DOI] [PubMed] [Google Scholar]
  • [10].Lam D, Zhang X, Li H, Yang D, Schott B, Zhao T, Zhang W, Mutic S, Sun B. Predicting gamma passing rates for portal dosimetry-based IMRT QA using machine learning, Med Phys. 2019; 46:4666–4675. [DOI] [PubMed] [Google Scholar]
  • [11].Granville DA, Sutherland JG, Belec JG, La Russa DJ. Predicting VMAT patient-specific QA results using a support vector classifier trained on treatment plan characteristics and linac QC metrics, Phys MedBiol. 2019; 64:. [DOI] [PubMed] [Google Scholar]
  • [12].Interian Y, Rideout V, Kearney VP, Gennatas E, Morin O, Cheung J, Solberg T, Valdes G. Deep nets vs expert designed features in medical physics: An IMRT QA case study, Med Phys, 2018; 45:2672–2680. [DOI] [PubMed] [Google Scholar]
  • [13].Tomori S, Kadoya N, Takayama Y, Kajikawa T, Shima K, Narazaki K, Jingu K. A deep learning-based prediction model for gamma evaluation in patient-specific quality assurance, Med Phys. 2018; 45:4055–4065. [DOI] [PubMed] [Google Scholar]
  • [14].Mahdavi SR, Tavakol A, Sanei M, Molana SH, Arbabi F, Rostami A, Barimani S. Use of artificial neural network for pretreatment verification of intensity modulation radiation therapy fields, British J Radiol. 2019; 92:20190355.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Kimura Y, Kadoya N, Tomori S, Oku Y, Jingu K. Error detection using a convolutional neural network with dose difference maps in patient-specific quality assurance for volumetric modulated arc therapy, Physica Medica-European Journal of Med Phys. 2020; 73:57–64. [DOI] [PubMed] [Google Scholar]
  • [16].Wootton LS, Nyflot MJ, Chaovalitwongse WA, Ford E. Error Detection in Intensity-Modulated Radiation Therapy Quality Assurance Using Radiomic Analysis of Gamma Distributions, Int. J. Radiat. Oncol. Biol. Phys, 2018; 102:219–228. [DOI] [PubMed] [Google Scholar]
  • [17].Nyflot MJ, Thammasorn P, Wootton LS, Ford EC, Chaovalitwongse WA. Deep learning for patient-specific quality assurance: Identifying errors in radiotherapy delivery by radiomic analysis of gamma images with convolutional neural networks, Med Phys. 2019; 46:456–464. [DOI] [PubMed] [Google Scholar]
  • [18].Li JQ, Wang L, Zhang XL, Liu L, Li J, Chan MF, Sui J, Yang RJ. Machine Learning for Patient-Specific Quality Assurance of VMAT: Prediction and Classification Accuracy, Int. J. Radiat. Oncol. Biol. Phys, 2019; 105:893–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Wang L, Li J, Zhang S, Zhang X, Zhang Q, Chan MF, Yang R, Sui J. Multi-task autoencoder based classification-regression model for patient-specific VMAT QA, Phys Med Biol. 2020; 65:235023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Vandewinckele L, Claessens M, Dinkla A, Brouwer C, Crijns W, Verellen D, van Elmpt W. Overview of artificial intelligence-based applications in radiotherapy: Recommendations for implementation and quality assurance, Radioth Oncol. 2020; 153:55–66. [DOI] [PubMed] [Google Scholar]
  • [21].Pan Y, Yang R, Zhang S, Li J, Dai J, Wang J, Cai J. National survey of patient specific IMRT quality assurance in China, Radiation Oncology, 2019; 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].National Cancer Center NCQCC, Yang RJ Practice guideline of patient-specific dosimetric verification for intensity-modulated radiotherapy, Chinese Journal of Radiation Oncology, 2020; 29:1021–1024 (in Chinese). [Google Scholar]
  • [23].Kerns JR, Childress N, Kry SF. A multi-institution evaluation of MLC log files and performance in IMRT delivery, Radiat Oncol. 2014; 9:176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Smilowitz JB, Das IJ, Feygelman V, Fraass BA, Kry SF, Marshall IR, Mihailidis DN, Ouhib Z, Ritter T, Snyder MG, Fairobent L. AAPM Medical Physics Practice Guideline 5.a.: Commissioning and QA of Treatment Planning Dose Calculations - Megavoltage Photon and Electron Beams, J. Appl. Clin. Med. Phys, 2015; 16:14–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Glide-Hurst C, Bellon M, Foster R, Altunbas C, Speiser M, Altman M, Westerly D, Wen N, Zhao B, Miften M, Chetty IJ, Solberg T. Commissioning of the Varian TrueBeam linear accelerator: A multi-institutional study, Med Phys, 2013; 40:031719. [DOI] [PubMed] [Google Scholar]
  • [26].Glenn MC, Peterson CB, Followill DS, Howell RM, Pollard-Larkin JM, Kry SF. Reference dataset of users’ photon beam modeling parameters for the Eclipse, Pinnacle, and RayStation treatment planning systems, Med Phys. 2020; 47:282–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Nelms BE, Chan MF, Jarry G, Lemire M, Lowden J, Hampton C, Feygelman V. Evaluating IMRT and VMAT dose accuracy: Practical examples of failure to detect systematic errors when applying a commonly used metric and action levels, Med Phys, 2013; 40:111722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Ezzell GA, Burmeister JW, Dogan N, LoSasso TJ, Mechalakos JG, Mihailidis D, Molineu A, Palta JR, Ramsey CR, Salter BJ, Shi J, Xia P, Yue NJ, Xiao Y. IMRT commissioning: Multiple institution planning and dosimetry comparisons, a report from AAPM Task Group 119, Med Phys, 2009; 36:5359–5373. [DOI] [PubMed] [Google Scholar]
  • [29].Hussein M, Rowshanfarzad P, Ebert MA, Nisbet A, Clark CH. A comparison of the gamma index analysis in various commercial IMRT/VMAT QA systems, RadiothOncol. 2013; 109:370–376. [DOI] [PubMed] [Google Scholar]

RESOURCES