Skip to main content
Advances in Radiation Oncology logoLink to Advances in Radiation Oncology
. 2024 Nov 13;10(2):101675. doi: 10.1016/j.adro.2024.101675

Performance Comparison of 10 State-of-the-Art Machine Learning Algorithms for Outcome Prediction Modeling of Radiation-Induced Toxicity

Ramon M Salazar a,, Saurabh S Nair a, Alexandra O Leone a, Ting Xu b, Raymond P Mumme a, Jack D Duryea a, Brian De b, Kelsey L Corrigan b, Michael K Rooney b, Matthew S Ning b, Prajnan Das b, Emma B Holliday b, Zhongxing Liao b, Laurence E Court a, Joshua S Niedzielski a
PMCID: PMC11665468  PMID: 39717195

Abstract

Purpose

To evaluate the efficacy of prominent machine learning algorithms in predicting normal tissue complication probability using clinical data obtained from 2 distinct disease sites and to create a software tool that facilitates the automatic determination of the optimal algorithm to model any given labeled data set.

Methods and Materials

We obtained 3 sets of radiation toxicity data (478 patients) from our clinic: gastrointestinal toxicity, radiation pneumonitis, and radiation esophagitis. These data comprised clinicopathological and dosimetric information for patients diagnosed with non-small cell lung cancer and anal squamous cell carcinoma. Each data set was modeled using 11 commonly employed machine learning algorithms (elastic net, least absolute shrinkage and selection operator [LASSO], random forest, random forest regression, support vector machine, extreme gradient boosting, light gradient boosting machine, k-nearest neighbors, neural network, Bayesian-LASSO, and Bayesian neural network) by randomly dividing the data set into a training and test set. The training set was used to create and tune the model, and the test set served to assess it by calculating performance metrics. This process was repeated 100 times by each algorithm for each data set. Figures were generated to visually compare the performance of the algorithms. A graphical user interface was developed to automate this whole process.

Results

LASSO achieved the highest area under the precision-recall curve (0.807 ± 0.067) for radiation esophagitis, random forest for gastrointestinal toxicity (0.726 ± 0.096), and the neural network for radiation pneumonitis (0.878 ± 0.060). The area under the curve was 0.754 ± 0.069, 0.889 ± 0.043, and 0.905 ± 0.045, respectively. The graphical user interface was used to compare all algorithms for each data set automatically. When averaging the area under the precision-recall curve across all toxicities, Bayesian-LASSO was the best model.

Conclusions

Our results show that there is no best algorithm for all data sets. Therefore, it is important to compare multiple algorithms when training an outcome prediction model on a new data set. The graphical user interface created for this study automatically compares the performance of these 11 algorithms for any data set.

Introduction

In radiation oncology, the development of outcome prediction models (OPMs) offers an objective approach to personalized cancer treatment.1 These models, based on machine learning (ML), provide a data-driven understanding of patient-specific responses to radiation therapy (RT), thereby enabling clinicians to anticipate and mitigate potential toxicities more accurately.2, 3, 4 This enhances the precision of therapeutic interventions and may significantly improve patient outcomes by preemptively addressing the likelihood of specific toxicities.

OPMs of radiation pneumonitis (RP) from RT of non-small cell lung cancer (NSCLC) can identify patients with a high risk of developing this toxicity based on their treatment plan, which provides information on the likelihood of lung toxicity when adjusting radiation dose.5 Similar models may provide predictions for the likelihood of acute radiation esophagitis (RE), a dose-limiting toxicity that can greatly reduce patient quality of life.6 This may inform potential early plan adaptation strategies to minimize the probability of RE.7 For pelvic cancers, OPMs are valuable tools in assessing the risk of gastrointestinal toxicities (GITs), which may help guide the choice of radiation techniques to minimize damage to the gastrointestinal tract.8 These are some examples of how OPMs can form part of a clinical decision support system, enhancing patient care by creating opportunities to reduce the severity of toxicities.

The advancement of OPMs, however, faces some challenges. One of them is the lack of comparative analyses of different ML algorithms tailored for radiation-induced toxicity prediction. The landscape of predictive models is greatly varied, with new types of algorithms constantly being introduced, and there exists no universally acknowledged best algorithm that suits all data sets.9 Given this diversity, researchers must rely on their personal experience, ease of implementation, or prevalence in the literature to select an algorithm. Yet, most existing studies focus on a single algorithm for a specific disease site, limiting our understanding of how different models perform across various types of toxicities.

Another critical issue is the interpretability of OPMs. Many advanced algorithms lack transparency in how they generate their predictions. In a field where clinical decision-making is highly nuanced and patient-specific, the inability to interpret and understand the rationale behind model-based predictions is a significant drawback. This lack of transparency can impede the trust and adoption of OPMs by radiation oncology professionals.10

Previous work by Deist et al11 has shown that there is no algorithm type that is superior in performance across all data sets. This highlights the need for a comparative multialgorithm approach to outcome prediction. A free, open-source package that automatically conducts these analyses can address this need while also enabling a wider range of health care institutions to benefit from cutting-edge ML technologies. Such a package could foster collaborative improvement and validation of the algorithms, as the global medical physics community could contribute to its refinement and customization.12

While some free, open-source packages offering automatic multialgorithm comparisons do exist, they are designed as general-purpose tools for virtually any ML task.13 However, it has been shown that while these tools may show superior performance in some domains, they may underperform in others.14 Hence, the development of a specialized package for multialgorithm analysis tailored for RT is essential.

In our study, we develop a graphical user interface (GUI) using the open-source R package caret (version 6.0.90) to facilitate and automate the comparison of multiple models for any RT data set.15,16 We use this GUI to compare the efficacy of different OPMs in predicting RT outcomes, and this serves as a test of its capabilities and functionality as an automatic ML tool. We seek to evaluate a variety of state-of-the-art algorithms across 3 radiation-induced toxicities to investigate if their performance is data set-dependent. The GUI provides mathematically robust interpretations for each model and a systematic way to select the best model for each application. This analysis is essential for researchers seeking a rigorous method of model selection. The code for this GUI will be made available on publication of this study.

Methods and Materials

Patient characteristics

We obtained a waiver of consent and approval from the institutional review board for this retrospective study. Three toxicity data sets (478 patients in total) were collected from our institution. One data set involved 246 squamous cell carcinoma of the anus (SCCA) patients with different grades of acute GIT (Table E1).17 The second data set consisted of 232 NSCLC patients, where the outcome of interest was RP (Table E2). The third data set contained the same 232 NSCLC patients, but the outcome of interest was RE (Table E3).

The 246 patients with SCCA were part of a retrospective study.18 They were treated from 2003 to 2019 with definitive intensity modulated RT (IMRT) or volumetric modulated arc therapy (VMAT) based chemoRT at our institution. All of them were treated in the head-first to gantry, supine position, and had at least 12 months of clinical follow-up. They all received conventionally fractionated radiation using a simultaneous integrated boost technique; most patients received 2 Gy per fraction to the primary tumor to a total dose of 50 to 58 Gy and 1.6 to 1.7 Gy per fraction to the elective nodal volume to a total dose of 43 to 47 Gy. Additionally, 236 patients underwent concurrent chemotherapy, 6 received sequential chemotherapy, and 4 received no chemotherapy. Their inclusion criteria were no previous history of pelvic RT, and their computed tomography image was used for treatment planning to encompass the entire bowel bag.

The 232 patients with NSCLC were part of a clinical trial.19 They were treated between 2006 and 2019 at our institution with either IMRT/VMAT (n = 116), passive-scatter proton therapy (n = 71), or intensity modulated proton therapy (n = 45). The IMRT/VMAT treatment plans were designed using Pinnacle (Philips Healthcare), and proton treatments were planned using Eclipse (Varian Medical Systems). All patients received fractionated radiation consisting of doses between 1.8 and 3.0 Gy per fraction. The total dose ranged from 60 to 74 Gy, and patients were treated 5 times per week. All patients were treated with either induction, concurrent, or adjuvant chemotherapy in addition to RT. Their inclusion criteria were an absence of acute lung infection, no prior thoracic surgery, at least 12 months of clinical follow-up, and no previous history of thoracic RT.

Toxicity evaluation

GITs were graded by the treating physician during weekly see and follow-up visits according to the National Cancer Institute Common Terminology Criteria for Adverse Events (CTCAE) version 5.0 and included abdominal pain, colitis, colonic disorders, constipation, diarrhea, enterocolitis, fecal incontinence, lower GI hemorrhage, nausea, malabsorption, small intestine disorders, and vomiting.20 All these toxicities were reported to be grade 0 for all patients at the start of treatment. The highest grade on any of the listed disorders within 3 to 6 months of treatment was recorded as the toxicity outcome for a patient. The outcomes were then divided into 2 classes, with one class being a toxicity less than grade 3 and the other being a toxicity greater than or equal to grade 3. The percentage of patients who developed a grade 3 or higher acute GIT was 9.5% (n = 28).

RP was graded from 0 to 5 according to the CTCAE v5.0.20 The approximate time for RP to develop in a patient after RT is between 1 and 6 months.21 The highest grade within 2 years of treatment was recorded as the toxicity outcome for that patient for this study. A patient with an RP grade ≥ 2 was considered symptomatic for RP, and hence, the outcomes were separated into 2 classes, with one being RP grade ≥ 2 and the other being an RP grade < 2. The percentage of patients who developed a grade 2 or higher for RP was 29.3% (n = 68). RP grade ≥ 2 was selected as a significant clinical endpoint to facilitate direct comparisons with existing studies.

Similarly, RE was also graded from 0 to 5 according to CTCAE v5.0.20 The highest grade within 2 years of treatment was recorded as the toxicity outcome for that patient for this study. However, RE peaked during treatment. A patient with an RE grade ≥ 2 was considered symptomatic for RE, and hence, patients with an RE grade ≥ 2 were classed separately from those with an RE grade < 2. The percentage of patients who developed a grade 2 or higher for RE was 60% (n = 139). RE grade ≥ 2 was selected as a significant clinical endpoint to facilitate direct comparisons with existing studies.

Note that the incidence of adverse events for the endpoints of interest was 9.5%, 29.3%, and 60.0% for GIT, RP, and RE, respectively. These ratios were not artificially altered to achieve balanced data sets. They follow the natural incidence obtained from the raw data as they were recorded in the clinical trials.

The definitions for GIT, RE, and RP toxicity grades have remained consistent between the current CTCAE version and the version at the time of treatment. Furthermore, the clinical records of each patient were reviewed to determine if a reassessment of their toxicity grade was necessary.

Dose-volume histogram metrics

Dose-volume histogram (DVH) metrics were extracted from the Raystation treatment planning system (RaySearch Laboratories). The metrics extracted were the mean, maximum, and minimum dose, the relative volume receiving at least a given dose (ie, the value obtained for V30Gy[%] refers to the percent volume of a structure receiving at least 30 Gy) from 5 to 50 Gy in 5 Gy increments, and the total volume of the region of interest (the full bowel bag for GIT, the lung volume minus the tumor volume for RP, and the esophagus for RE).

Algorithms

Eleven different algorithms were chosen, based on their frequent usage in medical data analysis, to model the 3 toxicities (Table 1). They are briefly described below, but more information can be found about them in the literature22:

  • Least absolute shrinkage and selection operator (LASSO): a linear regression analysis method that performs variable selection with L1 norm regularization.

  • Elastic net: a regularized regression method that linearly combines L1 and L2 penalties.

  • Bayes LASSO: a Bayesian approach to LASSO regression that incorporates a prior distribution on the regression coefficients.

  • K-nearest neighbors: to predict the label of a new instance, k-nearest neighbors identify the “K” training examples that are nearest to the instance and return the most common output.

  • Support vector machine: a learning algorithm that finds the hyperplane that best separates the classes in the input feature space.

  • Extreme gradient boosting: builds an ensemble of decision trees in a sequential manner, where each tree corrects the errors made by the previous ones.

  • Light gradient boosting machine: this is a gradient-boosting framework based on decision tree algorithms. It is designed for speed and performance.23,24

  • Neural networks: they consist of interconnected nodes (neurons) arranged in layers, with a weight for each connection that is adjusted during training to minimize the error in predictions.

  • Bayes neural network: unlike traditional neural networks, Bayes neural networks estimate a probability distribution for each weight in the network, providing a measure of uncertainty in predictions.

  • Random forest: an ensemble learning method that constructs multiple decision trees during training and outputs the most common class predicted by individual trees.

  • Random forest regression: an extension of random forest to regression problems. It outputs the mean prediction of the individual trees for new data points.

Table 1.

Basic characteristics of the algorithms

Model type Caret label R package Requires dummy variables? Tuned hyperparameters
LASSO glmnet glmnet Yes lambda
Elastic net glmnet glmnet Yes alpha, lambda
Bayes LASSO blasso monomvn Yes sparsity
K-nearest neighbors Knn Yes k
Support vector machine svmRadial kernlab Yes Sigma, C
XGBoost xgbTree xgboost No nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample
Neural network nnet nnet No Size, decay
Bayes neural network brnn brnn No neurons
Random forest Rf randomForest No Mtry, maxnodes
Regression random forest Rf randomForest No Mtry, maxnodes
LightGBM NA lightgbm No num_leaves, max_bin, n_estimators, min_child_samples, max_depth, learning_rate, bagging_fraction

Abbreviations: LASSO = least absolute shrinkage and selection operator; LightGBM = light gradient boosting machine; NA = not available; XGBoost = extreme gradient boosting.

Model construction

The schematic of model construction is illustrated in Fig. 1. Each data set was partitioned into a test set and a training set. Training/test set splits of 0.5, 0.6, 0.7, 0.8, and 0.9 were analyzed. An 8-times repeated 2-fold cross-validation was used on the SCCA training data set for tuning the hyperparameters of each individual model type and each data set split. For the NSCLC data sets, a 6-times repeated 3-fold cross-validation was implemented instead. All model types were run in classification mode except for random forest regression, which was run in regression mode. Hyperparameter tuning was achieved by minimizing the root mean square error for the regression model and maximizing kappa for the classifiers. Kappa is defined as the agreement between our classifier and the observed outcomes, taking the hypothetical probability of chance agreement into account. In terms of confusion matrix elements, it can be calculated as

к=2×(TP×TNFN×FP)(TP+FP)×(FP+TN)+(TP+FN)×(FN+TN)

Figure 1.

Figure 1

Diagram of the model building and assessment process. Preprocessing, depending on the model being trained, may involve dummy coding, deleting zero variance features, or rescaling. The randomization and splitting of the initial data set correspond to a Monte Carlo cross-validation approach for the outer loop. The inner loop undergoes repeated k-fold cross-validation or out-of-bag error minimization for model construction.

Abbreviations: AUC = area under the receiver operating characteristic curve; AUPRC = area under the precision-recall curve.

Models were trained on the training set and assessed on the test set to compute their performance metrics. The data set split with the best final performance based on the area under the precision-recall curve (AUPRC) was selected as the final model for each algorithm type.

This whole procedure was repeated 100 times with different randomization seeds. However, it is important to note that the same set of 100 randomization seeds was used for the 100 iterations of each algorithm type and data set split such that, for the same iteration number, the models were trained and tested on identical training and test sets for a specific data set. This allows for the performance metrics of any 2 algorithms to be analyzed through a pairwise comparison. The evaluation metrics calculated for each algorithm were accuracy, the area under the receiver operating characteristic curve (AUC), AUPRC, and the F1 score.

Note that for the algorithm trained in regression mode, the output it provides for each patient is a numerical score ranging from 0 to 1. This score becomes proportional to the probability of developing the toxicity in question after Platt scaling is applied. Accuracy is calculated by setting a threshold of 0.5, where patients with a score above the threshold are considered to be at risk of developing toxicity. AUC and AUPRC are evaluated by varying the risk threshold from 0 to 1, plotting its effect on sensitivity/specificity or precision/recall, respectively, and then finding the AUC. The F1 score is measured at an optimal threshold that will maximize precision and recall when assigning equal importance to both metrics. Calibration slope and intercept are also automatically measured and reported.

Algorithm rankings were computed for each data set and iteration by arranging the AUPRC or AUC for each model type in descending order. For simplicity, we limited the comparative analysis to AUC and AUPRC results.

We compared the effectiveness of choosing an algorithm based on its performance during the training cross-validation phase (data set-specific selection) against random model selection (which mimics a choice without prior modeling knowledge or data set consideration). Changes in predictive performance between those selection criteria were measured.

The Bayesian-LASSO model uses a Laplace (double exponential) prior distribution for the regression coefficients. This prior distribution is characterized by its sharp peak at zero and heavy tails, which promotes both sparsity in the coefficients (by pushing them toward zero) and robustness (by allowing some coefficients to remain relatively large).

Using the random forest regression model as a classifier is unconventional, but we have incorporated it into the GUI as it may offer advantages in handling imbalanced data sets. This is done by exploiting a key characteristic of the algorithm. For a random forest trained in classification mode, each individual decision tree votes either 0 or 1 to predict the outcome of a given patient in the test set based on the terminal node where the patient falls within the tree. The terminal node is populated with patients from the training set, which helps construct the decision tree. The tree will vote 1 if more than 50% of those training set patients have a 1 label and will vote 0 otherwise. In contrast, for a random forest in regression mode, each tree outputs an average of those values (ie, the actual proportion of outcomes labeled as 1 in that terminal node). This feature preserves the weight information of each tree, providing information about the strength of its prediction.

GUI

To facilitate the training and comparison of these algorithms, we created a GUI using the shiny library in R. This tool automates the process of model training, cross-validation, and performance analysis for a given data set. It eliminates the need to develop expertise in the 11 algorithms discussed here before being able to select one adequately. The analysis detailed in this work serves as a test of the functionality and usability of our GUI.

The GUI allows the user to input any data as a comma-separated values file. Once loaded, the user will see a list of all the features in the data file. Checkboxes will be available to select only the desired features. A drop-down menu is also available to designate the outcome variable. A report for exploratory data analysis may also be automatically generated (Fig. E1).

Next, one can select which algorithms will be trained, which training/test split ratios will be analyzed, how many iterations should be performed, what kind of cross-validation procedure should be used (Monte Carlo, k-fold, repeated k-fold, or leave-one-out), and whether an automatic feature selection (based on permutation importance) should be conducted. Recommended default values are provided for convenience (Fig. E2).

Any missing values in both training and test sets are imputed using medians for continuous attributes and modes for categorical attributes from the training set. Categorical attributes are subjected to dummy coding when required, which expresses categorical features through a set of binary attributes. Training and test set features are rescaled and normalized. Additionally, covariates with zero variance or near-zero variance are eliminated when necessary.

Near-zero variance variables were handled using the nearZeroVar function of the caret package. The default cutoff for a covariate to be removed by this function is for the ratio of the second most common value to the most common value to be below 5/95. Based on some studies, implementing nonZeroVar with its default values improves model performance as much as techniques/functions that require tuning parameters.25

Hyperparameter tuning is also performed automatically. A random search is initially performed across the hyperparameter phase space. The optimal result of that search is used as a central point for a grid search around that neighborhood. After this grid search, another random search is performed. If the results of this random search are better than those of the grid search, the process is repeated. Otherwise, the process ends, and the hyperparameters found using the grid search are kept.

After training the models, their individual accuracy, AUC, AUPRC, and F1 scores are calculated. Plots and tables are generated to help visualize the difference between the performances of these algorithms (Figure 2, Figure 3, Figure 4, Figure 5, Figure 6). The user should then be able to select the best algorithm for their application.

Figure 2.

Figure 2

A heat map of the ranks for each algorithm across 300 separate iterations. Obtaining a rank of 1 means the algorithm had the highest area under the receiver operating characteristic curve (AUC, top) or area under the precision-recall curve (AUPRC, bottom) for a particular iteration. The line on each box marks the median value of the ranks. The yellow diamond locates the mean of the ranks.

Abbreviations: BayesNN = Bayes neural network; KNNeighbors = k-nearest neighbors; LASSO = least absolute shrinkage and selection operator; LightGBM = light gradient boosting machine; NeurNet = neural network; RF = random forest; SVM = support vector machine; XGBTr = extreme gradient boosting.

Figure 3.

Figure 3

A plotted table showing pairwise comparisons among the algorithms. The numbers inside the cells represent the frequency in percentage (out of 300 comparisons) that the models on the vertical axis yielded a superior area under the receiver operating characteristic curve (AUC, left) or area under the precision-recall curve (AUPRC, right) than the models on the horizontal axis. The fill color corresponds to the P value for the separation of the AUC (left) or AUPRC (right) distributions of the pairs. The gray cells are redundant.

Abbreviations: BayesNN = Bayes neural network; KNNeighbors = k-nearest neighbors; LASSO = least absolute shrinkage and selection operator; LightGBM = light gradient boosting machine; NeurNet = neural network; RF = random forest; SVM = support vector machine; XGBTr = extreme gradient boosting.

Figure 4.

Figure 4

A plotted table showing the average area under the receiver operating characteristic curve (AUC, top left), area under the precision-recall curve (AUPRC, top right), calibration slope (bottom left), and calibration intercept (bottom right) of the models for each data set and the SD of each distribution (corresponding to the 100 iterations per data set). The optimal value for the AUC, AUPRC, and calibration score is 1. The optimal value for the calibration intercept is 0. The yellow diamonds highlight the models that were superior for each individual cohort. These results provide a comprehensive overview of the model calibration in relation to their predictive performance, emphasizing their suitability for clinical application.

Abbreviations: BayesNN = Bayes neural network; GIT = gastrointestinal toxicity; KNNeighbors = k-nearest neighbors; LASSO = least absolute shrinkage and selection operator; LightGBM = light gradient boosting machine; NeurNet = neural network; RE = radiation esophagitis; RF = random forest; RP = radiation pneumonitis; SVM = support vector machine; XGBTr = extreme gradient boosting.

Figure 5.

Figure 5

The 5 best features for the best 3 models for each toxicity. The top, middle, and bottom rows contain information on gastrointestinal toxicity (GIT), radiation pneumonitis (RP), and radiation esophagitis (RE), respectively.

Abbreviations: LASSO = least absolute shrinkage and selection operator; Neural Net = neural network.

Figure 6.

Figure 6

Feature value plotted against phi (the approximate change in the predicted probability of toxicity because of the given feature value) for the most important features of the 3 toxicities. Black dots represent patients who did not develop toxicity, while red dots represent those who did. Note that “Anal V30” refers to the relative volume of the bowel bag receiving at least 30 Gy.

Interpretability

Shapley values have emerged as a powerful tool for interpreting ML models. These values offer a mathematically rigorous method to assign a proportional contribution to each feature in a predictive model. In toxicity prediction, Shapley values enable us to understand the influence of each feature on the predictions of each model.26

Shapley values quantify the contribution of each feature by considering all their possible combinations and their corresponding impact on the prediction. This involves calculating the average marginal contribution of a feature across all possible feature combinations. This process ensures an unbiased assessment of feature importance. They are the only method that satisfies symmetry, null effect, and additivity, making them unique in the landscape of interpretability techniques.27

In toxicity prediction, where decisions can significantly impact patient outcomes, understanding why a model makes a prediction is critical. In our application, Shapley values convert the feature values of an individual patient into values referred to as “phi.” The phi values are model-specific and serve as quantitative indicators of the extent to which each feature alters the toxicity risk of a patient.

In our GUI, we have incorporated the computation of Shapley values for all 11 algorithms. This component enhances the transparency of the models and aids in comparative analysis, helping researchers and practitioners identify the most influential factors in toxicity prediction. By integrating Shapley values, our tool ensures that the predictions are interpretable and clinically actionable.

Statistical analysis

A P value < .05 was considered statistically significant. The R programming language (version 4.1.1, 2021-08-10, R Core Team, R Foundation for Statistical Computing) was used to conduct all the analyses in this study.15 To construct and assess all the different ML algorithms, we used the open-source R package caret (version 6.0.90), a tool that has demonstrated its ability to deliver competitive results.16 The paired Wilcoxon signed-rank test was used to compare the performance of the different model types by comparing each model in paired ranks for each iteration of model training and analysis.

Results

The model comparison process involves training toxicity models based on 11 distinct algorithms using the same training set for a single toxicity data set. Specific metrics are then calculated for each model using the same test set for comparison. This procedure is repeated 100 times for the same data set, randomly changing the training and test set for each iteration. In our study, the methodology was applied to 3 separate cases of radiation-induced toxicities, resulting in 300 comparative evaluations. The total computation time was approximately 3 days on a 4 Core Intel i7-8665U Processor with 32 GB of RAM. The full analysis was performed using the GUI, providing a robust test of its capabilities. All figures shown in the results are part of the output generated by this interface.

All data sets

Figure 2 displays the distribution of algorithm rankings based on the average AUC and AUPRC (3 toxicities × 100 repetitions = 300 data points per algorithm). This figure highlights the variability in performance across models, with Bayesian-LASSO, LASSO, and elastic net showing higher median ranks in AUC and AUPRC.

Figure 3 offers a detailed pairwise comparison among all models based on their performance metrics. The Bayesian-LASSO algorithm stands out, significantly outperforming other algorithms in most comparisons, as indicated by the Wilcoxon signed-rank tests. This suggests that Bayesian-LASSO might offer a robust choice for clinical applications, given its consistent performance across diverse data sets.

Data set dependence

As illustrated in Fig. 4, model performance varied significantly with the type of toxicity, highlighting the data set-dependent nature of algorithm efficacy. Random forest, neural network, and LASSO were the top performers for GIT, RP, and RE, respectively, as shown by their superior AUC and AUPRC values. The mean calibration slope and intercept also showed variability in model performance across the different toxicities. They also demonstrate that the models that performed the best in terms of predictability (AUC and AUPRC) may not be the most calibrated models. These results highlight the importance of multiple model comparisons based on the specific characteristics of a given data set.

Algorithm selection criteria

In Table 2, the mean AUPRCs, averaged over all 100 repetitions, are shown for random model selection and data set-specific model selection for each data set according to the type of toxicity. The data set-specific model selection results demonstrate an improvement of 0.066 for the mean AUPRC. Comparing this method to random selection using the Wilcoxon signed-rank test shows that the improvement in AUPRC is statistically significant.

Table 2.

In this table, 2 different model selection criteria are compared

Random algorithm Data set-specific algorithm
AUPRC AUPRC
Data set Mean Mean Increase
SCCA 0.602 ± 0.146 0.726 ± 0.096 0.124
NSCLC (RE) 0.788 ± 0.059 0.807 ± 0.067 0.019
NSCLC (RP) 0.822 ± 0.074 0.878 ± 0.060 0.056
Mean 0.066

The random algorithm method picks an algorithm at random for each iteration and calculates its AUPRC. For the data set-specific algorithm method, the best algorithm is selected based on the performance of the models during the training phase of all the iterations. An average AUPRC increase of 0.066 is obtained by using the latter method.

Abbreviations: AUPRC = area under the precision-recall curve; NSCLC = non-small cell lung cancer; RE = radiation esophagitis; RP = radiation pneumonitis; SCCA = squamous cell carcinoma of the anus.

Shapley values

The models were interpreted using Shapley values. The most important features are shown in Fig. 5. Note that for GIT and RP, the best feature was not always the same one. For all 3 toxicities, the combination of best features varied by model type, indicating complex interactions between model parameters and data set characteristics.

By isolating the most important features, it is possible to calculate their individual importance as they vary across patients (Fig. 6). We can then identify regions for these features where small modifications may have an appreciable impact on the probability of toxicity. Such detailed insights would be helpful for clinicians aiming to customize treatment plans to minimize patient-specific risks.

As a final note, 10% (n = 23) of NSCLC patients progressed to RP grade ≥ 3. These cases typically involved larger relative volumes of lung receiving over 40 Gy. Similarly, 9% (n = 20) of NSCLC patients progressed to RE grade ≥ 3. These cases typically involved a larger mean lung dose to the esophagus.

Discussion

In our study, we tested our newly developed GUI by training and assessing toxicity models using 11 distinct algorithms, conducting 100 comparisons of their performance for each of our 3 radiation-induced toxicities. We found that no algorithm was superior across all the different toxicities, emphasizing the importance of data set-dependent algorithm selection. However, our findings underscore the superior predictions of the Bayesian-LASSO algorithm, which outperforms other models when averaged across the analyzed toxicities. This result is noteworthy, as it highlights the robustness of the Bayesian-LASSO in diverse scenarios, suggesting that it should be a preferred choice in clinical applications for toxicity prediction. This is of particular interest as the formulation of this algorithm is relatively simple to interpret compared with other more complex, black box-type algorithms.28 The successful application of the GUI in these tests confirms its effectiveness in facilitating such complex analytical tasks and underscores its potential utility in clinical decision-making environments.

While Bayesian-LASSO generally excelled, random forest, the neural network, and LASSO showed superior performance for individual data sets (GIT, RP, and RE, respectively). This underlines the importance of comparing multiple algorithms when training models for toxicity prediction with RT data sets.

When comparing data set-specific model selection to opt for a random algorithm, the reported improvement in the mean AUPRC was 0.066. If this is compared with the worst-case scenario instead (selecting the lowest-performing model), the average AUPRC improvement could be as high as 0.125. Superior predictive capabilities mean clinicians can more reliably identify patients at higher risk for developing toxicities. This enables preemptive interventions, such as dose adjustments or the implementation of supportive care strategies, and personalization of treatment plans to better meet individual patient needs.29,30 Together with the interpretability offered by Shapley values, these models can show a physician that the clinicopathological factors of a particular NSCLC patient increase their risk of toxicity, suggesting that they could benefit from a decrease in mean lung dose. Moreover, the plot relating phi to the mean lung dose illustrates the impact of this change on the toxicity risk, which helps guide clinical decision-making. This aligns with the broader goals of precision medicine in oncology.31

Our top result for RP has an AUC of 0.905. Comparing this with other high-performing models in the literature, we see AUCs of 0.83 and 0.814.32,33 Part of the improvement may be explained by our data coming from a single institution (which means our models are trained based on the specific techniques of this clinic), but this also highlights the advantage of training a large variety of models to select the best one. This serves as an important motivation for our study. For the grade 3 GITs, the literature shows an AUC of 0.633,34 which is much lower than our top 0.922 AUC result. However, note that the incidence of GIT in that reference is 44.7%, while the incidence in our case is 9%. The data imbalance causes the AUC to appear deceptively high in relation to the performance of the model, as shown by the lower AUPRC value of 0.679 for our case. Nevertheless, our method results in an improved model when correcting for the data imbalance, which supports the benefits of this approach. For RE, our best result has an AUC of 0.754. In the literature, we find better and comparable results with AUCs of 0.84 and 0.73, respectively, for similar cohorts.35,36 This demonstrates that, while model selection may yield improvements, better results may be achieved with different data sets.

The results showing that data-specific model selection yields an improvement over random model selection may be explained by considering the differing strengths of these algorithms. Some of these are better at modeling nonlinear effects, while others are better at extrapolating and interpolating data.37 It is then beneficial to have a systematic method for modeling any data using all the algorithms mentioned in this study and comparing their predictive performance. This is the purpose of our GUI.

Each separate algorithm comes with its own complexities. Comparing the performance of different algorithms on a particular data set can become considerably time-consuming for a researcher.9,38 The GUI developed here allows for an automatic, interactive, end-to-end multialgorithm analysis of any RT data set while providing mathematically robust model interpretability through the use of Shapley values.

Our analysis and the developed GUI have significant potential for integration into heuristic clinical decision support systems.39 By providing interpretable and accurate toxicity predictions, our models could enhance the decision-making process in RT. The use of Shapley values adds a layer of interpretability that is vital for clinical acceptance and application.40 Clinicians can leverage these insights to understand the key factors influencing toxicity risk, allowing for more informed and personalized patient care strategies.41 This ability to interpret the predictions given by an algorithm is critical for its credibility and clinical acceptability, so selecting a model based on the most transparent interpretation may be desirable.42,43

There were some limitations to our study. The analysis for all data sets was performed retrospectively, which could result in selection bias. Our work includes 3 toxicities, 2 of which are from the same treatment site (lung) and all of which are from the same institution. While this does not represent a comprehensive sample of treatment outcome data sets studied in the field of RT, the results do serve as evidence for the data sets’ dependence on the ML algorithms. Additionally, we did not consider the expertise of the investigator into account. Considerable expertise on a specific algorithm could justify its selection, as it has been shown that the impact of hyperparameter tuning can surpass that of algorithm selection.44 Nonetheless, comparing those results with the performance of the top algorithms included in this study is still advisable. It should also be noted that the full comparative analysis of the different algorithms takes a considerable amount of time (approximately 1 day per toxicity for a data set of about 300 patients). While this may not be an issue for a big institution, smaller clinics may not have the resources to allow this much computational time.

Given the size and variety of the data studied in this work, expanding the range of toxicities and incorporating a variety of future clinical data sets from different institutions could enhance the utility and applicability of these models. The patterns that the data in Fig. 6 follow offer another approach to build on this work. Fitting the points with carefully selected functions for all the feature plots of a data set may allow for the creation of nomograms or other heuristic tools for different toxicities.45 The performance of these functions can then be assessed and compared with the underlying model. The development of this GUI is meant to serve as a starting point for an evolving tool with the goal of slowly accumulating RT data to improve the overall reliability of OPMs.

Finally, we acknowledge that there exist alternative modeling techniques that incorporate spatial information from dose distributions. Studies by Dean et al46 and McWilliam et al47 have highlighted the utility of spatial dose metrics in enhancing predictions of radiation-induced toxicity, suggesting potential areas for future integration into our models. Moreover, the highly correlated nature of DVH data presents unique challenges. In our study, each algorithm handled this issue through its own distinct feature selection and handling methods, which likely contributed to the differences observed in their performance. Alternatively, this challenge can be addressed through statistical approaches, as demonstrated in several recent studies that treat DVH curves as functional data.48,49

Conclusions

We have modeled treatment outcomes across 3 radiation-induced toxicities using 11 diverse algorithms using the open-source software R, interfaced with the package caret. Our findings provide evidence that the Bayesian-LASSO logistic regression algorithm yields superior discriminative performance on average across the data sets. However, when comparing performance within individual data sets, random forest, the neural network, and LASSO achieved the best ranks. These results demonstrate that model performance can be data set-specific, and an informed algorithm selection based on training phase performance can improve the final predictive ability. The study successfully tested a GUI that provides a systematic method for automating multialgorithm comparisons of OPMs, highlighting its utility in generating visual results and offering mathematically sound model interpretations.

Disclosures

Ramon M. Salazar, Alexandra O. Leone, Saurabh S. Nair, and Joshua S. Niedzielski report support through a grant from Varian Medical Systems. Joshua S. Niedzielski also reports a research grant from the Fund for Innovations in Cancer Informatics. Brian De reports grant funding from RSNA (RR2111) and honoraria from Sermo, Inc Prajnan Das reports honoraria from ASTRO, ASCO, Beyer, Imedex, Physicians Education Resource, and Conveners. Laurence E. Court reports grants from Varian Medical Systems, NCI, CPRIT, Wellcome Trust, and the Fund for Cancer Informatics.

Acknowledgments

We would like to acknowledge the members of the Toxicity Research Workgroup as well as the Court Lab for their support and emphasize the value of the University of Texas MD Anderson High Performance Computing Center, which made this research possible. Ramon M. Salazar was responsible for statistical analysis.

Footnotes

Sources of support: This research was funded by Varian Medical Systems.

Anonymized data will be made available upon reasonable request after publication to researchers with a methodologically sound proposal to achieve the aims in the approved proposal. A data transfer agreement must be reached between the requestor's institution and MD Anderson Cancer Center.

Supplementary material associated with this article can be found in the online version at doi:10.1016/j.adro.2024.101675.

Appendix. Supplementary materials

Supplementary_Document_Unmarked
mmc1.docx (578.1KB, docx)

References

  • 1.Rancati T, Fiorino C, Sanguineti G, Valdagni R, Orlandi E. Editorial: Modeling for prediction of radiation-induced toxicity to improve therapeutic ratio in the modern radiation therapy era. Front Oncol. 2021;11 doi: 10.3389/fonc.2021.690649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang Z, Li VR, Chu FI, et al. Predicting overall survival for patients with malignant mesothelioma following radiotherapy via interpretable machine learning. Cancers (Basel) 2023;15:3916. doi: 10.3390/cancers15153916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chamseddine I, Kim Y, De B, et al. Predictive modeling of survival and toxicity in patients with hepatocellular carcinoma after radiotherapy. JCO Clin Cancer Inform. 2022;6 doi: 10.1200/CCI.21.00169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Núñez-Benjumea FJ, González-García S, Moreno-Conde A, Riquelme-Santos JC, López-Guerra JL. Benchmarking machine learning approaches to predict radiation-induced toxicities in lung cancer patients. Clin Transl Radiat Oncol. 2023;41 doi: 10.1016/j.ctro.2023.100640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yakar M, Etiz D, Metintas M, Ak G, Celik O. Prediction of radiation pneumonitis with machine learning in stage iii lung cancer: A pilot study. Technol Cancer Res Treat. 2021;20 doi: 10.1177/15330338211016373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Niedzielski JS, Yang J, Stingo F, et al. A novel methodology using CT imaging biomarkers to quantify radiation sensitivity in the esophagus with application to clinical trials. Sci Rep. 2017;7(1):6034. doi: 10.1038/s41598-017-05003-x. Published 2017 Jul 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Alam SR, Zhang P, Zhang SY, et al. Early prediction of acute esophagitis for adaptive radiation therapy. Int J Radiat Oncol Biol Phys. 2021;110:883–892. doi: 10.1016/j.ijrobp.2021.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Raturi VP, Tochinai T, Hojo H, et al. Dose-volume and radiobiological model-based comparative evaluation of the gastrointestinal toxicity risk of photon and proton irradiation plans in localized pancreatic cancer without distant metastasis. Front Oncol. 2020;10 doi: 10.3389/fonc.2020.517061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Isaksson LJ, Pepa M, Zaffaroni M, et al. Machine learning-based models for prediction of toxicity outcomes in radiotherapy. Front Oncol. 2020;10:790. doi: 10.3389/fonc.2020.00790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Luo Y, Tseng HH, Cui S, Wei L. Ten Haken RK, El Naqa I. Balancing accuracy and interpretability of machine learning approaches for radiation treatment outcomes modeling. BJR Open. 2019;1 doi: 10.1259/bjro.20190021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Deist TM, Dankers FJWM, Valdes G, et al. Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers. Med Phys. 2018;45:3449–3459. doi: 10.1002/mp.12967. Published correction appears in Med Phys. 2019;46:1080-1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Luo G, Stone BL, Johnson MD, et al. Automating construction of machine learning models with clinical big data: Proposal rationale and methods. JMIR Res Protoc. 2017;6:e175. doi: 10.2196/resprot.7757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rashidi HH, Tran N, Albahra S, Dang LT. Machine learning in health care and laboratory medicine: General overview of supervised learning and Auto-ML. Int J Lab Hematol. 2021;43(suppl 1):15–22. doi: 10.1111/ijlh.13537. [DOI] [PubMed] [Google Scholar]
  • 14.Musigmann M, Akkurt BH, Krähling H, et al. Testing the applicability and performance of auto ML for potential applications in diagnostic neuroradiology. Sci Rep. 2022;12:13648. doi: 10.1038/s41598-022-18028-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.R Core Team . R Foundation for Statistical Computing; Vienna, Austria: 2021. R: A language and environment for statistical computing. [Google Scholar]
  • 16.Max Kuhn. caret: Classification and regression training. R package version 6.0-90; 2021. https://CRAN.R-project.org/package=caret
  • 17.Salazar RM, Duryea JD, Leone AO, et al. Random Forest Modeling of Acute Toxicity in Anal Cancer: Effects of Peritoneal Cavity Contouring Approaches on Model Performance. Int J Radiat Oncol Biol Phys. 2023:0360–3016. doi: 10.1016/j.ijrobp.2023.08.042. [DOI] [PubMed] [Google Scholar]
  • 18.Holliday EB, Morris VK, Johnson B, et al. Definitive Intensity-Modulated Chemoradiation for Anal Squamous Cell Carcinoma: Outcomes and Toxicity of 428 Patients Treated at a Single Institution. Oncologist. 2022;27(1):40–47. doi: 10.1093/oncolo/oyab006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.University of Texas MD Anderson Cancer Center. Trial of image-guided adaptive conformal photon vs proton therapy, with concurrent chemotherapy, for locally advanced non-small cell lung carcinoma: treatment related pneumonitis and locoregional recurrence. ClinicalTrials.gov. Published June 2009. Updated May 2020. Identifier: NCT00915005. Study Chair: Zhongxing Liao, MD. Accessed December 4, 2024. https://clinicaltrials.gov/study/NCT00915005
  • 20.MedDRA. Common Terminology Criteria for Adverse Events (CTCAE) v5.0. Accessed October 13, 2023. https://www.meddra.org/.
  • 21.Arroyo-Hernández M, Maldonado F, Lozano-Ruiz F, Muñoz-Montaño W, Nuñez-Baez M, Arrieta O. Radiation-induced lung injury: Current evidence. BMC Pulm Med. 2021;21:9. doi: 10.1186/s12890-020-01376-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Korstanje J. 1st ed. Apress Berkeley; 2021. Advanced forecasting with Python: With state-of-the-art-models including LSTMs, Facebook's prophet, and Amazon's DeepAR. [Google Scholar]
  • 23.Zhang J, Mucs D, Norinder U, Svensson F. LightGBM: An effective and scalable algorithm for prediction of chemical toxicity-application to the Tox21 and mutagenicity data sets. J Chem Inf Model. 2019;59:4150–4158. doi: 10.1021/acs.jcim.9b00633. [DOI] [PubMed] [Google Scholar]
  • 24.Adachi T, Nakamura M, Shintani T, et al. Multi-institutional dose-segmented dosiomic analysis for predicting radiation pneumonitis after lung stereotactic body radiation therapy. Med Phys. 2021;48:1781–1791. doi: 10.1002/mp.14769. [DOI] [PubMed] [Google Scholar]
  • 25.Kumar C S, RamaSree RJ. Dimensionality reduction in automated evaluation of descriptive answers through zero variance, near zero variance and non frequent words techniques - A comparison. 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO); Coimbatore, India; 2015. Paper presented at: January 9-10. [Google Scholar]
  • 26.Ning Y, Ong MEH, Chakraborty B, et al. Shapley variable importance cloud for interpretable machine learning. Patterns (N Y) 2022;3 doi: 10.1016/j.patter.2022.100452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rodríguez-Pérez R, Bajorath J. Interpretation of machine learning models using shapley values: Application to compound potency and multi-target activity predictions. J Comput Aided Mol Des. 2020;34:1013–1026. doi: 10.1007/s10822-020-00314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Mutshinda CM, Sillanpää MJ. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics. 2010;186:1067–1075. doi: 10.1534/genetics.110.119586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Van den Bosch L, van der Schaaf A, van der Laan HP, et al. Comprehensive toxicity risk profiling in radiation therapy for head and neck cancer: A new concept for individually optimised treatment. Radiother Oncol. 2021;157:147–154. doi: 10.1016/j.radonc.2021.01.024. [DOI] [PubMed] [Google Scholar]
  • 30.Desideri I, Loi M, Francolini G, Becherini C, Livi L, Bonomo P. Application of radiomics for the prediction of radiation-induced toxicity in the IMRT era: Current state-of-the-art. Front Oncol. 2020;10:1708. doi: 10.3389/fonc.2020.01708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Arimura H, Soufi M, Kamezawa H, Ninomiya K, Yamada M. Radiomics with artificial intelligence for precision medicine in radiation therapy. J Radiat Res. 2019;60:150–157. doi: 10.1093/jrr/rry077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lee S, Ybarra N, Jeyaseelan K, et al. Bayesian network ensemble as a multivariate strategy to predict radiation pneumonitis risk. Med Phys. 2015;42:2421–2430. doi: 10.1118/1.4915284. [DOI] [PubMed] [Google Scholar]
  • 33.Katsuta Y, Kadoya N, Kajikawa T, et al. Radiation pneumonitis prediction model with integrating multiple dose-function features on 4DCT ventilation images. Phys Med. 2023;105 doi: 10.1016/j.ejmp.2022.11.009. [DOI] [PubMed] [Google Scholar]
  • 34.Nilsson MP, Gunnlaugsson A, Johnsson A, Scherman J. Dosimetric and clinical predictors for acute and late gastrointestinal toxicity following chemoradiotherapy of locally advanced anal cancer. Clin Oncol (R Coll Radiol) 2022;34:e35–e44. doi: 10.1016/j.clon.2021.09.011. [DOI] [PubMed] [Google Scholar]
  • 35.von Reibnitz D, Yorke ED, Oh JH, et al. Predictive modeling of thoracic radiotherapy toxicity and the potential role of serum alpha-2-macroglobulin. Front Oncol. 2020;10:1395. doi: 10.3389/fonc.2020.01395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Monti S, Xu T, Mohan R, Liao Z, Palma G, Cella L. Radiation-induced esophagitis in non-small-cell lung cancer patients: Voxel-based analysis and NTCP modeling. Cancers (Basel) 2022;14:1833. doi: 10.3390/cancers14071833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.den Boer AV, Sierag DD. Decision-based model selection. Eur J Oper Res. 2021;290:671–686. [Google Scholar]
  • 38.Neagu DC, Guo G, Trundle PR, Cronin MT. A comparative study of machine learning algorithms applied to predictive toxicology data mining. Altern Lab Anim. 2007;35:25–32. doi: 10.1177/026119290703500119. [DOI] [PubMed] [Google Scholar]
  • 39.Draguet C, Barragán-Montero AM, Vera MC, et al. Automated clinical decision support system with deep learning dose prediction and NTCP models to evaluate treatment complications in patients with esophageal cancer. Radiother Oncol. 2022;176:101–107. doi: 10.1016/j.radonc.2022.08.031. [DOI] [PubMed] [Google Scholar]
  • 40.Ladbury C, Li R, Danesharasteh A, et al. Explainable artificial intelligence to identify dosimetric predictors of toxicity in patients with locally advanced non-small cell lung cancer: A secondary analysis of RTOG 0617. Int J Radiat Oncol Biol Phys. 2023;117:1287–1296. doi: 10.1016/j.ijrobp.2023.06.019. [DOI] [PubMed] [Google Scholar]
  • 41.Lapierre A, Bourillon L, Larroque M, et al. Improving patients' life quality after radiotherapy treatment by predicting late toxicities. Cancers (Basel) 2022;14:2097. doi: 10.3390/cancers14092097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hrinivich WT, Wang T, Wang C. Editorial: Interpretable and explainable machine learning models in oncology. Front Oncol. 2023;13 doi: 10.3389/fonc.2023.1184428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Barragán-Montero A, Bibal A, Dastarac MH, et al. Towards a safe and efficient clinical implementation of machine learning in radiation oncology by exploring model interpretability, explainability and data-model dependency. Phys Med Biol. 2022;67 doi: 10.1088/1361-6560/ac678a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Probst P, Boulesteix AL, Bischl B. Tunability: Importance of hyperparameters of machine learning algorithms. J Mach Learn Res. 2019;20:1934–1965. [Google Scholar]
  • 45.Meng X, Wang N, Yu M, et al. Development of a nomogram for predicting grade 2 or higher acute hematologic toxicity of cervical cancer after the pelvic bone marrow sparing radiotherapy. Front Public Health. 2022;10 doi: 10.3389/fpubh.2022.993443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Dean JA, Wong KH, Welsh LC, et al. Normal tissue complication probability (NTCP) modelling using spatial dose metrics and machine learning methods for severe acute oral mucositis resulting from head and neck radiotherapy. Radiother Oncol. 2016;120:21–27. doi: 10.1016/j.radonc.2016.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.McWilliam A, Kennedy J, Hodgson C, Vasquez Osorio E, Faivre-Finn C, van Herk M. Radiation dose to heart base linked with poorer survival in lung cancer patients. Eur J Cancer. 2017;85:106–113. doi: 10.1016/j.ejca.2017.07.053. [DOI] [PubMed] [Google Scholar]
  • 48.Benadjaoud MA, Blanchard P, Schwartz B, et al. Functional data analysis in NTCP modeling: A new method to explore the radiation dose-volume effects. Int J Radiat Oncol Biol Phys. 2014;90:654–663. doi: 10.1016/j.ijrobp.2014.07.008. [DOI] [PubMed] [Google Scholar]
  • 49.Dean JA, Wong KH, Gay H, et al. Functional data analysis applied to modeling of severe acute mucositis and dysphagia resulting from head and neck radiation therapy. Int J Radiat Oncol Biol Phys. 2016;96:820–831. doi: 10.1016/j.ijrobp.2016.08.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Document_Unmarked
mmc1.docx (578.1KB, docx)

Articles from Advances in Radiation Oncology are provided here courtesy of Elsevier

RESOURCES