Skip to main content
Medical Physics logoLink to Medical Physics
. 2019 Apr 15;46(5):2006–2014. doi: 10.1002/mp.13515

Characterization of a Bayesian network‐based radiotherapy plan verification model

Samuel M H Luk 1,, Juergen Meyer 1, Lori A Young 1, Ning Cao 1, Eric C Ford 1, Mark H Phillips 1,2, Alan M Kalet 1
PMCID: PMC9559708  PMID: 30927253

Abstract

Purpose

The current process for radiotherapy treatment plan quality assurance relies on human inspection of treatment plans, which is time‐consuming, error prone and oft reliant on inconsistently applied professional judgments. A previous proof‐of‐principle paper describes the use of a Bayesian network (BN) to aid in this process. This work studied how such a BN could be expanded and trained to better represent clinical practice.

Methods

We obtained 51 540 unique radiotherapy cases including diagnostic, prescription, plan/beam, and therapy setup factors from a de‐identified Elekta oncology information system from the years 2010–2017 from a single institution. Using a knowledge base derived from clinical experience, factors were coordinated into a 29‐node, 40‐edge BN representing dependencies among the variables. Conditional probabilities were machine learned using expectation maximization module using all data except a subset of 500 patient cases withheld for testing. Different classes of errors that were obtained from incident learning systems were introduced to the testing set of cases which were withheld from the dataset used for building the BN. Different sizes of datasets were used to train the network. In addition, the BN was trained using data from different length epochs as well as different eras. Its performance under these different conditions was evaluated by means of Areas Under the receiver operating characteristic Curve (AUC).

Results

Our performance analysis found AUCs of 0.82, 0.85, 0.89, and 0.88 in networks trained with 2‐yr, 3‐yr 4‐yr and 5‐yr windows. With a 4‐yr sliding window, we found AUC reduction of 3% per year when moving the window back in time in 1‐yr steps. Compared to the 4‐yr window moved back by 4 yrs (2010–2013 vs 2014–2017), the largest component of overall reduction in AUC over time was from the loss of detection performance in plan/beam error types.

Conclusions

The expanded BN method demonstrates the ability to detect classes of errors commonly encountered in radiotherapy planning. The results suggest that a 4‐yr training dataset optimizes the performance of the network in this institutional dataset, and that yearly updates are sufficient to capture the evolution of clinical practice and maintain fidelity.

Keywords: artificial intelligence, bayesian network, error detection, quality assurance

1. Introduction

Errors in the process of radiation therapy planning and delivery can impact outcomes, reduce the therapeutic effects, and increase normal tissue damage. Unfortunately such errors do occur in the complex discipline of radiation oncology. For example, a reanalysis of the TROG 02.02 head‐and‐neck clinical trials data showed that overall survival was significantly impacted by plan quality1 and such deviations affected approximately 10% of patients. Similar effects were seen in the RTOG 9704 pancreatic cancer trial2 and in other trials as well.3, 4 The data from Radiation Oncology Incident Learning System (RO‐ILS) also suggests that erroneous problematic treatment plans, such as dose per fraction in treatment and plan differ, is one of the major error pathways in radiation oncology.5

Until a few decades ago, radiation therapy treatments could be described by a few dozen variables. Quality assurance procedures (QA) involved multiple people independently examining and verifying these parameters. Today, the situation is markedly more complex. Publications on risk and hazard assessment of the radiotherapy processes using Failure Mode Effects Analysis document the extreme complexity of the process and the numerous error pathways.6 Each treatment is a procedure that requires the successful completion of over 200 separate tasks performed by many people over the course of days, weeks, and, in some cases, months. Many of those tasks involve the use of computer systems to track patient information, to image the patient, to compute treatment‐related parameters, and to coordinate a vast amount of data. It is clear that the number of variables and systems that must be monitored is beyond the scope of human attention and that continuing to use standard QA measures results in a plethora of data that cannot be adequately inspected for errors. Moreover, the trend toward hypofractionation increases the possibility that undetected errors could lead to significant deleterious effects.

To address this issue of developing improved methods of error detection, we proposed a novel artificial intelligence (AI) approach using Bayesian networks (BN). This approach was taken because although automated rules‐based approaches have been developed to assist the plan checking process7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and have led to reduced error rates and increased efficiency,22 it was felt that they have significant limitations and do not provide the sort of domain knowledge and human reasoning, which is essential to a comprehensive review of treatment plans.

Bayesian networks are an appropriate tool for this sort of problem for several reasons: (a) their structure represents the relationships between elements and variables that best mirrors our human understanding of them, (b) their calculation of conditional probabilities encapsulates our reasoning process in looking for suitability of variable values, (c) they tend to handle missing data better than many other algorithms, and (d) the incorporation of clinical data in the form of the conditional probability tables (CPT) assures that the results of the network are directly related to clinical reality. Our BN‐based approach of error detection aims to provide automated assistance in the judgment‐type components of plan checking by directing the clinical physicists, physicians and others to more closely inspect parts of the plan that are flagged as potential errors, thereby improving error detection and also reducing the ever‐increasing and unsustainable burden on physicists and others responsible for detecting those errors.

Our original proof‐of‐principle paper demonstrated that an appropriate BN could be constructed, that its performance was comparable to humans even with relatively few variables, and that it was feasible to mine an existing oncology information system and use machine learning on them to determine the CPT. Although the proof‐of‐principle studies were promising, important questions remain as to how much data are needed to train the network, how often the network should be updated to stay effective, and whether differential performance exists among different error types. This paper reports our work on answering these questions in an expanded BN that include a wider range of clinically relevant variables. The result of this study will inform our recommendations for training dataset size and updating frequency.

2. Materials and methods

2.A. Network topology

As probabilistic directed acyclic graphical models, BNs require both a topological structure of nodes and edges indicating the direction and relation of causal concepts, and data tables containing the concrete conditional probabilities between linked concepts.23 The method we used to expand our existing probabilistic network topology includes the use of structured surveys and semistructured interviews of professionals in the field.24, 25, 26 The survey results were combined with concept extraction from a knowledge‐based ontology of causal relationships27 to create a comprehensive network topology which encompasses a wide range of jointly dependent therapy plan parameters. The causal relation set introduces an independent set of knowledge containing real‐world–derived understanding of dependencies among variables in radiation oncology, backed by well‐established domain literature.27

In a previous work,28 we constructed a simple model (nine nodes) and described an ontology relating some core concepts with more detailed descriptions of therapy variables in a dependency‐related model. In this work, we built upon these simpler models to construct a larger network designed to capture and describe more granular‐level information while enjoying a wider scope of coverage. Our second‐generation model, shown in Fig. 1, contains 29 nodes covering diagnostic parameters (T_Stage, N_Stage, M_Stage), prescription‐level factors (Treatment_Intent, Number_of_Rxs, Dose_Per_Fraction, PTV_dose_Rx, Total_Fractions, Rx_Radiation_Type), plan/beam‐level factors (Plan_Technique, Beam_Energy, Table_Angle, Gantry_Angle, Collimator_Angle, Wedge, SSD, Control Points, Number_of_beams, Bolus), setup‐level parameters (Couch_Vert, Couch_Lat, Couch_Long, Orientation, Tolerance_Table), and also a series of Setup Device nodes describing immobilization devices. Additionally, we have added a diagnostic node, Anatomic tumor loc, to allow for the selection of broad anatomic disease site categories (BRAIN, LIVER, ESOPHAGUS, etc.). The diagnostic parameters, prescription‐level factors, plan/beam‐level factors, and setup‐level parameters are color coded as gray, blue, red, and green in Fig. 1, respectively. Compared to the simple model, the second‐generation model covers more details of the prescription, plan/beam, and setup parameters. The terminology used in Fig. 1 corresponds to the classes in the ontology discussed in Kalet et al.,28 and the detailed descriptions of individual parameters are given in Table 1.

Figure 1.

Figure 1

Bayesian network representing 29 variables used in radiation therapy prescriptions and planning. Gray shaded nodes indicate diagnostic parameters, blue nodes are prescriptive, red nodes are classified as plan/beam nodes while green nodes represent setup parameters. [Color figure can be viewed at wileyonlinelibrary.com]

Table 1.

Description of nodes, node types, and examples of states found in our instance of Mosaiq OIS database. Node names correspond to nodes in the BN shown in Fig. 1. The number of available states is obtained from the 2014–2017 (4‐yr) data

Node type Node name Description Available states Example states
Diagnostic Anatomic_tumor_loc Anatomic disease site 70 BLADDER, PELVIS…
T_Stage Primary tumor stage 52 T1b1, 1a…
N_Stage Regional node stage 38 N1a, 2…
M_Stage Distant metastasis stage 12 M0, M1a…
Prescription Treatment_Intent Treatment intent 12 Curative (adjuvant), Palliative…
Number_of_Rxs Number of prescriptions 14 2, 1…
Dose_Per_Fraction Dose per fraction 116 180, 1200…
PTV_dose_Rx Total dose 275 6660, 5040…
Total_Fractions Fractions 50 37,5…
Rx_Radiation_Type Radiation type 19 x06, Electrons…
Plan/Beam Plan_Technique Planning technique 41 IMRT, AP/PA…
Table_Angle Table angle 155 0, 357…
Number_of_beams Number of beams 74 9, 2…
Wedge Wedge position 2 0, 1
Control_Points Control points 227 20, 239…
SSD Source to surface distance 96 92, 88…
Bolus Presence/type of bolus 91 1 cm neck, no bolus…
Gantry_Angle Gantry angle 350 350, 0…
Collimator_Angle Collimator angle 273 0, 30…
Beam_Energy Beam energy 10 6, 15…
Setup Orientation Patient scan orientation 9 Head In, Supine…
Couch_Lat Lateral couch position 60 1, 5…
Couch_Long Longitudinal couch position 102 74, 68…
Couch_Vert Vertical couch position 84 14, 0…
Tolerance_Table Setup tolerance table 21 UW IMRT, !Electron…
Setup_Device_1 Immobilization device 25 Shuttle, Breast Board…
Setup_Device_2 Immobilization device 26 Head Rest B, Custom Occipital…
Setup_Device_3 Immobilization device 21 Vacu‐Lock, Compression Plate…
Setup_Device_4 Immobilization device 20 NULL, Stent…

2.B. Network probabilities

From the set of concepts identified in the ontology, we built the node‐edge framework of the BN in R with Bayes Server API (Bayes Server Limited, Worthing, United Kingdom) computationally representing the topology in Fig. 1. This framework utilized relevant variables from an instance of our institutional Mosaiq (version 2.62) oncology information system (OIS) database (Elekta AB, Stockholm, Sweden). The Mosaiq OIS uses a relational database schema, which can be queried via structured query language to extract data. In this process, we generated a flat file (.csv) consisting of cases suitable for BN construction. Each case in the dataset was a single external beam and its associated parameters in the treatment plans. All individual beams in the plans, regardless of the purpose of the plan (say a primary or boost plan), were treated separately and were considered as an unique case in the dataset. Our choice of using a single external beam as a case instead of the usual patient case in clinic is because we would like to take the beam orientation into account. This would require the data of each individual beam in a treatment, such as gantry angle and beam energy, to be the basic unit in the model. The cases were obtained from an IRB approval from the University of Washington Medical Center, Seattle WA over the course of 8 yr (2010–2017).

Among those uniquely represented states in the data were 70 disease types (ICD‐9/10 classified), 41 planning techniques, 19 radiation types, and 12 treatment intents. We processed the dataset by filtering out QA, warm‐up and testing beams from the data. Moreover, we kept only one of the multiple beams that were identical except the control point number as a beam with more than one control point is stored multiple times in the database. Missing entries were filled with a missing indicator “NULL”. After processing, a total of 51 540 cases were obtained over 8 yr (2010–2017).

The processed data were then divided according to the start date of treatment into different time frames traced back from 2017, and the networks trained by these datasets were used for evaluation. The list of parameters and possible available states in the dataset are shown in Table 1. Example states are given in the right column of Table 1 to present the types of mixed data values handled by the BN models.

From these topologies and datasets, we used Bayes Server to machine learn the CPT. The Bayes Server system employs an expectation maximization (EM) algorithm to learn the tables among all network concepts. The CPTs which power the underlying BN model scale with the product of the number of states in each set of connected nodes. The Dose_Per_Fraction node, for example, contains 97 440 unique probability entries. Although some node tables required over one million probability values, the time to learn the 29‐parameter network is under 13 min in R using a dataset with around 24 000 cases in a 4‐yr (2014–2017) dataset, which would be the approximate size of a dataset from a clinic treating 1200 patients per year for 4 yr. This work was performed with the goal of keeping the dataset size and processing to a level in which contemporary “big data” methods29 were not required, while also representing as closely as possible the conditions of practical clinical datasets.

2.C. Evaluation

The network was trained using the patient cases described above except for a randomly selected subset of 500 patient cases which were withheld from training. A fraction (5.8%) of the test cases were altered by manually introducing a known set of potential errors designed for the most recent clinical practice, for example, wrong total dose, wrong fractional dose, and convolved errors such as nonoptimal energy for given site. The introduced errors were selected based on our departmental incident reporting system and the SAFRON system, an international voluntary incident reporting system implemented by the International Atomic Energy Agency, to ensure realistic types of errors and appropriate rates of error.30, 31 This includes major potential error pathways such as prescription errors5. Five hundred cases in this context correspond to approximately 85 new patient plan checks at an average rate of 6 beams per plan.

Table 2 outlines a list of example tested error classes at the prescription level, plan/beam level, and setup level along with the specific parameters. Compare to the previous study28, the error set contained similar error classes, with an expansion on plan/beam and setup levels errors as we capitalized the expanded scope of the second‐generation model. Among the new error types we included wrong technique for given disease, setup, and orientation errors such as a wrong orientation and/or improper immobilization device.

Table 2.

Classification and examples of error types tested for using the BN model

Prescription‐level errors

540 cGy total dose prescribed for 28‐fraction curative esophagus

18 MV modality prescribed for brain VMAT

Only 2 fractions prescribed for curative prostate boost

Plan/Beam‐level errors

180 Gantry angle for VMAT prostate (treat through couch rails)

VMAT technique for pancreas planned with 0 collimator angle Wrong SSD for 4‐field box bladder plan (SSD set to 105)

Prescription for 6 MeV electron with beam energy selected at 15 MeV

18MV energy selected for brain VMAT

Setup‐level errors

Electron tolerance table selected for SBRT Liver

Breast board setup device used for T‐Spine plan

Headrest setup device selected for prostate case

To measure the performance of the network, we instantiated the diagnostic parameter values (T_Stage, N_Stage, M_Stage, Anatomic_tumor_loc), and the prescription parameter value (Treatment_Intent) for each test case, then propagated the network probabilities to the remaining downstream nodes. This mimics the plan and chart review process which usually rests on the assumption that the staging and intent are correct. We then evaluated each remaining case parameter's probability value P against a threshold value T that designated whether that parameter should be flagged as either correct or in error. The error threshold was incremented in steps of 0.001 from 0 to 1 for each iteration of network testing to identify true positive/negative as well as false positive/negatives. A receiver operating characteristic (ROC) curve was generated for each network model.32

To understand the effect of practice changes over time on the performance of the network at error detection tasks, we built the network repeatedly with multiple sets of data, each of them contained data from different time frames within 2010–2017. We then ran the performance tests described above with the set of error cases developed for the 2014–2017 data time frame. The performance of each of the separate networks was used to determine the frequency of network updates as more data accumulate. Similarly, we measured the networks' performances relative to the category of error.

3. Results

3.A. Network performance on different time frames

We measured the performances of networks trained by datasets of 2‐yr, 3‐yr, 4‐yr, 5‐yr, 6‐yr, and 7‐yr time frames traced back from 2017. The number of cases extracted from Mosaiq were 11 891, 18 365, 24 483, 31 276, 38 577, and 45 585, respectively, after processing. The resulting ROC curves for the error sets are shown in Fig. 2. The AUC values of the 2‐yr, 3‐yr, 4‐yr, 5‐yr, 6‐yr, and 7‐yr time frames were 0.82, 0.85, 0.89, 0.88, 0.87, and 0.87, respectively. The discerning ability for the 29‐node BN model was found to be well above random guessing (AUC of 0.5) regardless of the time frame. The network performances were similar in the networks trained by the 4‐yr (2014–2017) and 5‐yr (2013–2017) datasets, showing that the extra data from 2013 did not improve the discerning ability of the network. Moreover, we observed a drop in performance with the 2‐yr (2016–2017) and 3‐yr (2015–2017) dataset‐trained networks.

Figure 2.

Figure 2

ROC curves for the BN model trained by 2 yrs, 3 yrs, 4 yrs, 5 yrs, 6 yrs, and 7 yrs of data traced back from 2017. [Color figure can be viewed at wileyonlinelibrary.com]

Other than the time frame, the amount of data to train the probabilistic tables in Bayesian network is also crucial to the performance of the network. As the size of the available dataset is clinic dependent, we studied the impact on network performance using training datasets with fewer data. We reduced the 2014–2017 data (24 483 cases) via random sampling into three smaller datasets with 12 242, 8161, and 4897 cases, respectively, which is approximately 50, 33, and 20 patients a month, to mimic the amount of data collected in smaller clinics. We rebuilt and evaluated the network with the undersampled datasets, and the resulting ROC curves are shown in Fig. 3 along with the original 2014–2017 network. The ROC curves of the first two cases overlapped mostly with each other, showing that a decrease in the number of cases from 24 483 to 12 242 did not worsen the performance of the network. However, when the number of training cases further decreased to 8161, the AUC dropped by around 4% to 0.85. It showed that a data size of around 10 000 cases in 4 yr would be sufficient to create a reliable network from our institutional data, and although a smaller dataset worsened the network performance, the discerning ability of the network was still acceptable. Note that although 2015–2017 had more cases (18 365) compared to the 12 242 cases, it performed worse than the trimmed 2014–2017 data, which showed that it is more important to include sufficient amount of historical clinical data that captured a wider range of clinical practices than just including more recent cases.

Figure 3.

Figure 3

ROC curves for the BNs trained by randomly sampled 2014–2017 datasets with different number of cases. [Color figure can be viewed at wileyonlinelibrary.com]

3.B. Network update

As clinical practice evolves over time, the BN model can adapt through updated datasets and CPT retraining. However, updating frequently is only justified if network performance is improved. To understand the effect of an outdated network on error detection performance, we evaluated the performance of BNs trained with data obtained in different time frames. Figure 4 shows the ROC curves of five 4‐yr windows in 2010–2017, and Table 3 shows the AUCs of 4‐yr windows with half year updates. Small differences were observed between the ROCs and the AUC values are 0.89, 0.86, 0.83, 0.79, and 0.78 for 2014–2017 (black), 2013–2016 (red), 2012–2015 (blue), 2011–2014 (green), and 2010–2013 (purple), respectively. As mentioned, the 4‐yr data showed a progressive decrease in AUC at around a 3% per year rate. It showed that the network's discerning ability on errors designed for the most recent clinical practice was reduced when CPTs were not relearnt using data that more closely aligned with the current time frame, and a yearly update on the dataset to retrain the network was sufficient to adapt the latest practice and retain the network performance. The remaining AUCs of 2‐yr, 3‐yr, and 5‐yr data are shown in Table 4. Most of the data showed consistently decreasing AUCs as older epochs were considered. This implied a decrease in discerning ability on errors over time. The number of cases was similar between different time frames with the same period, suggesting that the changes in AUCs were caused by the inability of the old data to detect evolved clinical practice, rather than that caused by changes in the number of training cases.

Figure 4.

Figure 4

ROC curves for the BN model trained by datasets from different 4‐yr windows. [Color figure can be viewed at wileyonlinelibrary.com]

Table 3.

Area under curve (AUC) of receiver operating characteristic (ROC) curve for the BN model trained by datasets from different 4‐yr windows, with a half year step. Periods in bold are the AUC of the ROC curves presented in Fig. 4

Time frame AUC
01/2014–12/2017 0.89
07/2013–06/2017 0.88
01/2013–12/2016 0.86
07/2012–06/2016 0.84
01/2012–12/2015 0.83
07/2011–06/2015 0.81
01/2011–12/2014 0.79
07/2010–06/2014 0.79
01/2010–12/2013 0.78

Table 4.

Area under curve (AUC) of receiver operating characteristic (ROC) curve of each network model trained by datasets with different time frames. We also show the number of cases of each data frame after filtering as described in Section 2.B

Time frame Years Number of cases AUC
2 yrs 2016–2017 11 891 0.82
2015–2016 12 360 0.81
2014–2015 12 592 0.81
2013–2014 12 911 0.79
2012–2013 14 094 0.79
2011–2012 14 309 0.79
2010–2011 12 963 0.78
3 yrs 2015–2017 18 365 0.85
2014–2016 18 478 0.81
2013–2015 19 385 0.85
2012–2014 20 212 0.79
2011–2013 21 102 0.78
2010–2012 20 264 0.79
4 yrs 2014–2017 24 483 0.89
2013–2016 25 271 0.86
2012–2015 26 686 0.83
2011–2014 27 220 0.79
2010–2013 27 057 0.78
5 yrs 2013–2017 31 276 0.88
2012–2016 32 572 0.84
2011–2015 33 694 0.81
2010–2014 33 175 0.79

On average, the AUCs dropped by 0.8%, 1.38%, 3.2%, and 3.6% per year going to older epoches for the network trained with 2‐yr, 3‐yr, 4‐yr, and 5‐yr datasets, respectively. The results showed that error detection ability of the network trained by a longer time frame degrades faster as longer time frames contained older data that needed to be removed or updated to maintain network performance.

3.C. Network performance on different classes of errors

To present details about the networks' discerning ability for particular classes of errors, we re‐evaluated the network using subsets of three specific error classes (prescription errors, plan/beam errors, and setup errors) to examine the performance of the model under various conditions. A list of examples of the three error types is shown in Table 2. Separation between the ROC curves (and associated AUC values) would demonstrate differential performance among types of errors examined.

The performances of networks trained by 2014–2017 (new) and 2010–2013 (old) datasets testing on erroneous test cases designed for the most recent clinical practice with full set of errors, only prescription errors, plan/beam errors, and setup errors are shown in Table 5. AUC values were 0.89, 0.92, 0.86, and 0.95 for 2014–2017 network, and 0.78, 0.84, 0.73, and 0.93 for 2010–2013 network on overall error set, prescription error set, plan/beam error set, and setup error set, respectively. Only minor differences were observed in AUCs between plan/beam class of error, prescription errors, and the overall error detection performance of the model in the 2014–2017 time frame, suggesting that the error detection performance of the network was equally efficient among all error types examined. The 2010–2013 network showed a larger difference between performances for different error types. The performance was lower in all categories compare to the new network, and the discerning ability of the model degraded at a faster rate for plan/beam errors than it did for prescription errors, suggesting that the plan/beam parameters have been varied the most in the last few years.

Table 5.

Comparison of AUCs for BN model trained with 2014–2017 (new) and 2010–2013 (old) datasets testing on erroneous test cases designed for the most recent clinical practice. The numbers in the table are the AUCs of ROC curves evaluated with all error types, prescription, beam/plan, and setup errors

Time‐frame All error types Prescription Beam/Plan Setup
2014–2017 0.89 0.92 0.86 0.95
2010–2013 0.78 0.84 0.73 0.93

4. Discussion

In this work, we have presented our study on an expanded BN for automating the task of error detection and evaluated the network performance using a set of test cases with manually introduced errors, for the purpose of understanding the impact of training dataset periods and update frequency on the network performance in order to provide recommendations for clinical implementation. Evaluations showed that the model's performance compares well to both previous models and human subjects while making a significant progression in scope (to 29 parameters) beyond earlier preliminary proof‐of‐principle (9 parameters) networks28. A smaller amount of data (~50% cases in the 4‐yr data) is shown to be sufficient to train a reliable network when it includes an extensive period of data. Comparison of the networks trained with data from different epochs suggested that a 4‐yr training dataset combined with a yearly update is sufficient to build a reliable error detection network and capture the changes in local clinical practices over time. Model performance presented here is especially robust considering that the BN in this manuscript was not trained, weighted, or otherwise optimized specifically for the task of detecting any particular class of error or anomaly, as our results showed that the error detection ability of BN on prescription, plan/beam and setup errors are equally efficient.

The BN topology was expanded with concept extraction from a knowledge‐based ontology of causal relationships. Using a knowledge‐based ontology to create the BN removes the possibility of inferring causality from correlation that ML can introduce when topologies are learned strictly algorithmically. For example, although Stojadinovic et al. successfully produced a set of networks to improve survival prediction in colon cancer patients by applying ML algorithms on a large dataset,33 the resulting topologies have node–arc–node relations which contain opposite causal directions from each other. The ontology‐based approach avoids competing topological structure by canonizing dependency relations and acts as our standard. Another distinct advantage of reusing established structure is the removal of any need for significant computing power to learn topology. As more and more data are generated and made available by OIS, computing requirements for ML structure building can become prohibitive.

Evaluation of the performances of BN showed that the network can identify both atypical and erroneous plan parameters, such as those that might result from either a planning error or simply the introduction of new methods via change in practice/technology. Although some errors in the testset can be considered always wrong, many do not represent a strictly incorrect parameter, rather, they indicate suboptimal or special cases which may be acceptable, but due to rarity or other conditional factors should be flagged for further inspection by experts. Clinically, plan parameters that are wrong, suboptimal, or special are determined by calculating the conditional probability of the parameters given the others. The parameter is flagged when the probability falls outside some range, and a report listing all flags is generated and displayed to physicists. Manual verification and approval of all flagged items have to be performed before completing the plan reviewing process. For instance, we recreate a case where a patient was simulated for treatment planning using a slant board as a setup device for a thoracic spine lesion. The use of a slant board for patient comfort and tissue extension is not rare in our clinic, it is quite a common device for breast immobilization and its use is itself not an error, however, its unexpected use for the thoracic spine was unusual and led to incomplete contouring and density overrides of the slant board during planning, resulting in a posterior beam with a potential dose difference of up to 40%. The network showed the ability to flag this kind of “inappropriateness” and other rare, conditionally related erroneous values in these types of cases as a warning to plan checkers.

The performances of BN trained by datasets from different epochs showed that the 4‐yr or 5‐yr networks outperform 2‐yr or 3‐yr networks. It suggests that although a network performs better with more historical data, there is a saturation on performance improvement and adding extra historical data provide no further benefit, as suggested by the lower AUCs with the 6‐yr and 7‐yr network. The older clinical data may contain outdated clinical practices that are no longer commonly used nowadays and the inclusion of these data reduced the network performance. Better and more complete documentation in Mosaiq could also contribute to the performance improvement in networks with newer clinical data. From our observation, older clinical datasets contained more missing values in TNM staging and treatment intent, while the newer datasets contained more missing tumor location data. Other data are mostly complete in both old and new datasets due to the fact that most of these variables are necessary for the completion of treatments. Moreover, increased standardization could potentially lead to better network performance. However, no difference in terms of input standard was observed between old and new datasets as no new standard was established in our institution in recent years. Another factor that could contribute to the degradation of older clinical data is the introduction of an institutional SBRT program in 2010–2013 and an increase in 800 cGy palliative treatments between 2010 and 2014,34 both of which become common clinical practices afterward.

The BN model also demonstrated the ability to adapt clinical practice changes easily by updating the training dataset and retraining the network. The 3% per year drop in AUC performance in the 4‐yr data built BN models (Fig. 4 and Table 3) shows that for our institutional data, the model loses considerable detection ability every year and it is recoverable with a yearly updated training dataset. Table 3 shows that updating the BN model more frequently, say every half year, could be beneficial, but the difference in performance would be insignificant and may not be worth the effort in clinical practice. The decrease in detection ability of BN is the result of practice changes, mainly on plan/beam parameters as the comparison of detection performance on different error classes in different 4‐yr periods (2010–2013 vs 2014–2017) showed that the largest component of overall reduction in AUC was from the plan/beam error type. The decrease in the discerning ability of the BN model over time and the ability to maintain performance with updating data suggest that dynamical clinical practices in radiation oncology can be handled with ML models and continuously updating data.35 Note that as our institution have a full range of treatment modality including photon, electron, neutron, and proton, a variety of treatment techniques including IMRT, VMAT, TBI, SRS, SBRT etc., and treating almost every site, we consider our results obtained would be similar to a worst‐case scenario. For clinics with less variety of treatment sites and techniques and less changes in clinical practices, a shorter epoch and a less frequent update could build an equally effective BN model.

One of the main aims of this work is to better understand how this method could be generalized and used in clinics other than the one in which it was developed. Thus, our studies as to the amount of data needed, the time frames that gave best performance, and how the system performed as clinical practice evolved were geared to this issue, and these studies have built a foundation for future testing and implementation of the BN in other clinics. Although there are different possible scenarios, one that seems feasible is that a clinic's system could (a) be trained from internal data only if the dataset is large enough, or (b) be initialized with CPTs from a generalized pool of data. Particularly in the latter case, updates to the system would be calculated as soon as practical. In both cases, we envision an annual update to the CPTs to incorporate any possible changes in clinical practice.

While this work describes progress toward clinical implementation by suggesting training dataset period and update frequency, there are several areas in which it would benefit from additional research in the future. First, tests need to be performed to quantify the extent to which the system developed at this single institution will translate to other clinical practices and the ease with which our tools can be used to tailor it to local practices and preferences. Interinstitutional testing could also give insight into the scope of data quality limitations, and a community‐wide set of probabilities can be commissioned thereby providing a reliable standard against which to judge one's practice. Moreover, we noted that when we mimicked small clinics in our study, we are working on a worst‐case scenario, which the training dataset contained a small amount of data describing a large variety of treatment techniques and disease site. A real small clinic has a small amount of data and a more narrow range of disease sites, which should lead to a better network performance due to the more homogeneous CPT. This hypothesis will be further tested in the interinstitutional testing with the data from smaller clinics.

Secondly, further investigations on handling missing data using prior probability tables would be beneficial. A chart missing data elements, to a probabilistic model, does not automatically imply error, nor necessitate obtaining the element before treatment planning happens or QA/validation proceeds. BN model prior probability tables can account for missing information by making assumptions and continuing to operate. Thirdly, one ought to consider that in the current effort, the models were tested with manually introduced errors based on our institutional expertise and experience with real‐world incidents and incident learning systems. One challenge to extracting cases of “real” errors is that by the time the patient case is established in the database, the error has been found and fixed and no longer exists in the data. Where it has not been fixed, it is not possible to know without retrospectively analyzing the entire database. Curation and development of realistic error cases and near‐miss incidents will be an ongoing effort as new technologies and modes of failure appear clinically. Future work should thus also consider whether the BN model used in conjunction with simple rule‐based approach could comprise a double‐filter effect — broadening the detection space — to improve patient safety.36 Further study on the complementary effect of the BN and simple rule‐based algorithms could reduce the complexity of the network and improve error detection performance.

Finally, it is important to continue working on the radiation oncology ontology to inform the construction, modification, and growth of our network. We feel that a robust ontology, along with associated software37 and adoption of data standards,38, 39 can greatly reduce the work needed to update our BN while at the same time ensuring that the resulting models match our understanding of the fundamental processes of cancer and radiation therapy.

5. Conclusion

The expanded probabilistic (Bayesian) network presented in this manuscript demonstrates the ability to detect a variety of error classes common in radiotherapy planning data and suggests the amount of data needed for training and the required frequency of network training. The performance of the 29‐node network shows that a role exists for probabilistic algorithms to be employed in the kinds of QA tasks that help maintain the high‐quality patient of safety expected from rapidly evolving clinical environments.

Conflict of interest

The authors have no conflict to disclose.

Acknowledgments

Research reported in this publication was supported by the National Institutes of Health under award number 1R41CA217452. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • 1. Lester JP, Brian O, Jordi G, et al. Critical impact of radiotherapy protocol compliance and quality in the treatment of advanced head and neck cancer: results from trog 02.02. J Clin Oncol. 2010;28:2996–3001. [DOI] [PubMed] [Google Scholar]
  • 2. Abrams RA, Winter KA, Regine WF, et al. Failure to adhere to protocol specified radiation therapy guidelines was associated with decreased survival in rtog 9704—a phase iii trial of adjuvant chemotherapy and chemoradiotherapy for patients with resected adenocarcinoma of the pancreas. Int J Radiat Oncol Biol Phys. 2012;82:809–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ohri N, Shen X, Dicker AP, Doyle LA, Harrison AS, Showalter TN. Radiotherapy protocol deviations and clinical outcomes: a meta‐analysis of cooperative group clinical trials. J Natl Cancer Inst. 2013;105:387–393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Fairchild A, Straube W, Laurie F, Followill D. Does quality of radiation therapy predict outcomes of multicenter cooperative group trials? a literature review. Int J Radiat Oncol Biol Phys. 2013;87:246–260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ezzell G, Chera B, Dicker A, et al. Common error pathways seen in the ro‐ils data that demonstrate opportunities for improving treatment safety. Practical Radiation Oncology. 2018;8:123–132. [DOI] [PubMed] [Google Scholar]
  • 6. Ford EC, Gaudette R, Myers L, et al. Evaluation of safety in a radiation oncology setting using failure mode and effects analysis. Int J Radiat Oncol Biol Phys. 2009;74:852–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Azmandian F, Kaeli D, Dy JG, et al. Towards the development of an error checker for radiotherapy treatment plans: a preliminary study. Phys Med Biol. 2007;52:6511. [DOI] [PubMed] [Google Scholar]
  • 8. Furhang E, Dolan J, Sillanpaa J, Harrison L. Automating the initial physics chart checking process. J Appl Clin Med Phys. 2009;10:129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Alfredo SR, Pennington E, Waldron T, Bayouth J. Radiation therapy plan checks in a paperless clinic. J Appl Clin Med Phys. 2009;10:43–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Yang D, Moore KL. Automated radiotherapy treatment plan integrity verification. Med Phys. 2012;39:1542–1551. [DOI] [PubMed] [Google Scholar]
  • 11. Sun B, Rangaraj D, Palaniswaamy G, et al. Initial experience with truebeam trajectory log files for radiation therapy delivery verification. Practical Radiation Oncology. 2013;3:e199–e208. [DOI] [PubMed] [Google Scholar]
  • 12. Moore KL, Kagadis GC, McNutt TR, Moiseenko V, Mutic S. Vision 20/20: automation and advanced computing in clinical radiation oncology. Med Phys. 2014;41:010901. [DOI] [PubMed] [Google Scholar]
  • 13. Xia J, Mart C, Bayouth J. A computer aided treatment event recognition system in radiation therapy. Med Phys. 2014;41:011713. [DOI] [PubMed] [Google Scholar]
  • 14. Dewhurst JM, Lowe M, Hardy MJ, Boylan CJ, Whitehurst P, Rowbottom CG. Autolock: a semiautomated system for radiotherapy treatment plan quality control. J Appl Clin Med Phys. 2015;16:339–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Hadley SW, Kessler ML, Litzenberg DW, et al. Safetynet: streamlining and automating qa in radiotherapy. J Appl Clin Med Phys. 2016;17:387–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Holdsworth C, Kukluk J, Molodowitch C, et al. Computerized system for safety verification of external beam radiation therapy planning. Int J Radiat Oncol Biol Phys. 2017;98:691–698. [DOI] [PubMed] [Google Scholar]
  • 17. Lack D, Liang J, Benedetti L, Knill C, Yan D. Early detection of potential errors during patient treatment planning. J Appl Clin Med Phys. 2018;19:724–732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Halabi T, Hsiao‐Ming L. Automating checks of plan check automation. J Appl Clin Med Phys. 2014;15:392–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Damato AL, Devlin PM, Bhagwat MS, et al. Independent brachytherapy plan verification software: improving efficacy and efficiency. Radiother Oncol. 2014;113:420–424. [DOI] [PubMed] [Google Scholar]
  • 20. Li HH, Wu Y, Yang D, Mutic S. Software tool for physics chart checks. Practical Radiation Oncology. 2014;4:e217–e225. [DOI] [PubMed] [Google Scholar]
  • 21. Brown WE, Sung K, Aleman DM, MorenoCenteno E, Purdie TG, McIntosh CJ. Guided undersampling classification for automated radiation therapy quality assurance of prostate cancer treatment. Med Phys. 2018;45:1306–1316. [DOI] [PubMed] [Google Scholar]
  • 22. Covington EL, Chen X, Younge KC, et al. Improving treatment plan evaluation with automation. J Appl Clin Med Phys. 2016;17:16–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Jensen FV. Bayesian Networks and Decision Graphs. Berlin, Germany: Springer‐Verlag, 2001. [Google Scholar]
  • 24. Wengraf T. Qualitative Research Interviewing: Biographic Narrative and Semi‐Structured Methods. Newcastle upon Tyne, UK: Sage; 2001. [Google Scholar]
  • 25. Laskey KB, Mahoney SM. Network engineering for agile belief network models. IEEE Trans Knowl Data Eng. 2000;12:487–498. [Google Scholar]
  • 26. Kalet AM. Bayesian Networks From Ontological Formalisms in Radiation Oncology. PhD thesis, University of Washington, 2015. [Google Scholar]
  • 27. Kalet AM, Doctor JN, Gennari JH, Phillips MH. Developing bayesian networks from a dependency‐layered ontology: a proof‐of‐concept in radiation oncology. Med Phys. 2017;44:4350. [DOI] [PubMed] [Google Scholar]
  • 28. Kalet AM, Gennari JH, Ford EC, Phillips MH. Bayesian network models for error detection in radiotherapy plans. Phys Med Biol. 2015;60:2735. [DOI] [PubMed] [Google Scholar]
  • 29. Jeffrey C, Brian D, Mark D, Joseph MH, Caleb W. Mad skills: new analysis practices for big data. Proceedings of the VLDB Endowment. 2009;2:1481–1492. [Google Scholar]
  • 30. Gopan O, Zeng J, Novak A, Nyflot M, Ford E. The effectiveness of pretreatment physics plan review for detecting errors in radiation therapy. Med Phys. 2016;43:5181–5187. [DOI] [PubMed] [Google Scholar]
  • 31. Novak A, Nyflot M, Sponseller P, et al. Improving patient safety through identification of origination points of serious errors in a near‐miss incident learning system. Int J Radiat Oncol Biol Phys. 2014;90:S130. [Google Scholar]
  • 32. Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27:861–874. [Google Scholar]
  • 33. Stojadinovic A, Bilchik A, Smith D, et al. Clinical decision support and individualized prediction of survival in colon cancer: Bayesian belief network model. Ann Surg Oncol. 2013;20:161–174. [DOI] [PubMed] [Google Scholar]
  • 34. Kalet AM, Luk SMH, Phillips MH. Quality assurance tasks and tools: the many roles of machine learning. Med Phys. 2019; 10.1002/mp.13445 [DOI] [PubMed] [Google Scholar]
  • 35. Nakatsugawa M, Cheng Z, Kiess A, et al. The needs and benefits of continuous model updates on the accuracy of rt‐induced toxicity prediction models within a learning health system. Int J Radiat Oncol Biol Phys. 2019;103:460–467. [DOI] [PubMed] [Google Scholar]
  • 36. Bojechko C, Phillps M, Kalet A, Ford EC. A quantification of the effectiveness of epid dosimetry and software‐based plan verification systems in detecting incidents in radiotherapy. Med Phys. 2015;42:5363–5369. [DOI] [PubMed] [Google Scholar]
  • 37. Dean M, Schreiber G, Bechhofer S, et al. Owl web ontology language reference. W3C Recommendation February, 10, 2004.
  • 38. Evans SB, Fraass BA, Berner P, et al. Standardizing dose prescriptions: an astro white paper. Practical Radiation Oncology. 2016;6:e369–e381. [DOI] [PubMed] [Google Scholar]
  • 39. Phillips M, Halasz L. Radiation oncology needs to adopt a comprehensive standard for data transfer: the case for hl7 fhir. Int J Radiat Oncol Biol Phys. 2017;99:1073–1075. [DOI] [PubMed] [Google Scholar]

Articles from Medical Physics are provided here courtesy of American Association of Physicists in Medicine

RESOURCES