Abstract
Background
Personalizing medical treatments based on patient-specific risks and preferences can improve patient health. However, models to support personalized treatment decisions are often complex and difficult to interpret, limiting their clinical application.
Methods
We present a new method, using machine learning to create metamodels, for simplifying complex models for personalizing medical treatment decisions. We consider simple interpretable models, interpretable ensemble models, and non-interpretable ensemble models. We use variable selection with a penalty for patient-specific risks and/or preferences that are difficult, risky, or costly to obtain. We interpret the metamodels to the extent permitted by their model architectures. We illustrate our method by applying it to simplify a previously developed model for personalized selection of antipsychotic drugs for patients with schizophrenia.
Results
The best simplified interpretable, interpretable ensemble, and non-interpretable ensemble models contained at most half the number of patient-specific risks and preferences compared to the original model. The simplified models achieved 60.5% (95% credible interval [crI]: 55.2-65.4), 60.8% (95% crI: 55.5-65.7), and 83.8% (95% crI: 80.8-86.6), respectively, of the net health benefit of the original model (quality-adjusted life years gained). Important variables in all models were similar and made intuitive sense. Computation time for the metamodels was orders of magnitude less than for the original model.
Limitations
The simplified models share the limitations of the original model (e.g., potential biases).
Conclusions
Our metamodeling method is disease- and model-agnostic and can be used to simplify complex models for personalization, allowing for variable selection in addition to improved model interpretability and computational performance. Simplified models may be more likely to be adopted in clinical settings and can help improve equity in patient outcomes.
INTRODUCTION
Personalizing medical treatment decisions based on patient-specific risks and preferences can significantly improve patient health. By accounting for patient-specific risks—for example, demographic factors, biological factors, and medical history—patients can be allocated to treatments that they respond to favorably. By accounting for patient-specific preferences—for example, preferences regarding treatment efficacy, side effects, and cost—patients can be allocated to treatments that they prefer.
A variety of approaches have been proposed for integrating evidence on treatment efficacy with patient-specific risks and preferences. These approaches variously account for population-level variations in risks and preferences,1–5 patient-specific variations in risks,6,7 patient-specific variations in preferences,8–10 and patient-specific variations in both risks and preferences.11
Such methods have not been widely adopted in clinical practice, for a variety of reasons. First, existing methods for treatment personalization often produce complex, black box models. It can be difficult for physicians and patients to understand and trust such models.12,13 While empirical studies have not fully examined the extent of this problem in specific settings, it is clear that both physicians and patients must be educated to appreciate algorithmically determined diagnosis and treatment options, particularly those arising from complex models, before such models can attain broad use. As one article notes, such education is crucial – but also complicated.14
Second, these methods can require running large simulation models to evaluate treatment alternatives for each patient, which can be too computationally costly for practical use. This may occur, for example, with complex microsimulation models and/or when deep learning models are used to predict patient-specific risks and develop treatment plans using high-dimensional, patient-specific data. Lengthy computation times may preclude real-time use of such models; this, along with complex computer code and sophisticated computation platforms, may hinder the adoption of such models.
Third, treatment personalization is often based on all risks and/or all preferences. However, some variables can be difficult, risky, or costly to obtain for each patient. In some cases, evaluating patient risks could even potentially harm patients if, for example, invasive medical procedures are required. The potential benefits of excluding such variables must be balanced with the potential downsides of less informed treatment selection.11
The Second Panel on Cost-effectiveness in Health and Medicine suggests that efficient model emulators, or metamodels, could be developed to reduce the computational cost of complex models for health decision making.15 A metamodel is a statistical approximation of an originally constructed model. Besides directly replacing a model in predicting the model outputs, metamodels can also be used to improve model interpretability16 and aid in model calibration.17
Metamodels have been most commonly used in the physical sciences and engineering.18–20 In health policy, metamodels have primarily been used to lower computation time of microsimulation and other complex models, to streamline value-of-information analysis, and to assess decision uncertainty due to uncertainty in model inputs.16,21–31 Recent studies have described the use of metamodels to simplify microsimulation models of strategies for infant HIV testing and screening,32 colorectal cancer screening,33 and hepatitis C testing and treatment in correctional settings.34
In this paper we propose a general, disease- and model-agnostic method that uses machine learning methods to develop metamodels that simplify complex models for personalization of medicine treatment decisions. Our method allows for determination of the optimal degree of personalization considering that some variables may be more difficult, risky, or costly to obtain than others, and enables creation of models that are more interpretable and less computationally expensive than complex black box models. We consider three model categories: simple interpretable models, interpretable ensemble models, and non-interpretable ensemble models. We demonstrate our method by applying it to simplify an existing, complex model for personalizing the selection of antipsychotic drugs for patients with schizophrenia. We assess the net health benefit, in units of quality-adjusted life years (QALYs), of using the metamodels and the original model to personalize treatment compared to no personalization, among other measures of model performance. We interpret the metamodels to the degree allowed by their model architectures.
METHODS
Metamodeling Framework
Our metamodeling method (Figure 1) works as follows. We begin by running the original, complex model to personalize treatment for a demographically representative population of patients. We create a dataset with, for each patient, features (demographics, risks, and preferences), projected mean net health benefit for each treatment, and the recommended treatment. We split this dataset into training and test datasets.
Figure 1.

Overview of metamodeling framework.
We preprocess the data by standardizing numeric patient features. For each numeric variable, we subtract the sample mean and divide by the sample standard deviation. During prediction, we apply this same variable standardization to patient features before passing them to the model. This standardization has a variety of benefits: for example, it may allow for faster gradient descent during model training.
We then train machine learning metamodels on the training dataset using various model architectures. We treat the treatment selection problem as a classification problem: that of identifying the best treatment for a patient. We include steps for hyperparameter tuning and variable selection. We consider interpretable and non-interpretable models. We define interpretable models as those that a physician could easily understand.
For the interpretable models, we consider commonly used decision tree models including classification and regression trees, conditional inference trees, and C5.0 decision trees.35–37 We consider decision tree models as they are easy for physicians to understand since the models visually represent key questions to ask patients and the recommended treatment in each case. We constrain the trees to have a maximum depth of four; other maximum tree depths could be used.
For the non-interpretable models, we consider supervised learning models that have previously been identified as potentially useful for health services research including generalized linear models, deep learning models, extremely randomized trees, distributed random forests, and gradient boosting machines.38
Models that combine predictions from multiple individual models can often yield improved performance. Thus, we consider both non-interpretable and interpretable ensemble models created using the SuperLearner algorithm, which has been shown to be asymptotically optimal.39–42 This algorithm allows for combination of predictions from multiple individual models and uses cross-validation to prevent overfitting. We classify the ensemble of interpretable models as interpretable since physicians could still understand the component models. However, such a model is less interpretable than its component models.
We evaluate model performance by computing the net health benefit, in units of QALYs, compared to full personalization based on all risks and preferences using the original model. Specifically, we compute mean projected net health benefit with the best patient-specific treatments recommended by the metamodel compared to the best treatment for the population with no personalization from the original model dataset. We also compute mean projected net health benefit with the best patient-specific treatments from the original model dataset compared to the best treatment for the population with no personalization from the original model dataset. We divide these two values and multiply by 100%; this is the percent of net health benefit retained by personalization using the metamodel compared to the original model.
During the process of training the metamodels, we use hyperparameter tuning to improve model performance. Hyperparameter tuning is a way of testing various parameters used to control a machine learning process (e.g., maximum tree depth) to determine the best values for those parameters. The number of hyperparameters and potential values depends on the machine learning model architecture. We use grid search to identify the best hyperparameter values. We perform hyperparameter tuning on the training dataset using cross-validation.
We also perform variable selection; that is, we determine which variables from the original model to include in a metamodel. Fitting models with all possible combinations of variables is frequently too computationally expensive; thus, we use a greedy approach. For a given model, we exclude one variable at a time from the model, refit the models, and calculate the projected net health benefit based on the new treatment recommendations.
Importantly, we allow for penalties to be included for difficult, risky, or costly-to-obtain variables. A variable may be difficult to obtain if it requires time and resources to evaluate: for example, eliciting patient preferences may require both patient and staff time, and possibly an electronic interface.43 A variable may be risky to obtain if collection of the data may adversely affect patient health – for example, a risk that requires an invasive medical procedure to evaluate. A variable may be costly to obtain if it requires use of medical equipment, such as imaging devices. If a variable is difficult or risky to obtain, we add QALYs following its removal equal to the QALY penalty for inclusion of that variable. If a variable is costly to obtain, a willingness-to-pay threshold must be specified and then the cost can be converted to QALYs as is typical in calculating net health benefit. In the US, a willingness-to-pay threshold of $50,000-$100,000/QALY may be appropriate.44 The cost penalty in units of QALYs can then be applied in the same way as before. These are not mutually exclusive categories: for example, a variable may be both risky and costly to obtain. In this case, the penalties, in units of QALYs, can be summed before they are applied.
We then select the best performing reduced model, retune the hyperparameters, and refit the model. We repeat this process until there are no variables left to exclude. We compare the performance of these models on the training dataset using cross-validation to determine the optimal degree of personalization for each model architecture. We select the best performing model in each category—interpretable, interpretable ensemble, and non-interpretable ensemble—and project their performance using the test dataset.
Finally, we interpret the models. We compute measures such as variable importance. For interpretable trees, we plot them.
Personalizing Schizophrenia Treatment
We illustrate our method by applying it to simplify an existing model for personalized selection of antipsychotic drugs for patients with schizophrenia.11 Briefly, the schizophrenia treatment selection model considers 15 antipsychotic drugs and placebo. It accounts for patient-specific risks and preferences for nine different possible outcomes (efficacy, drug cost, years of life lost, and six potential side effects—weight gain, extrapyramidal side effects, prolactin increase, QTc prolongation, sedation, and agranulocytosis). It also accounts for patient demographics including age, sex, and race (black, white, other). It personalizes treatment by simulating 10,000 identical versions of each patient on each treatment alternative over a period of five years and selecting the treatment that maximizes expected net health benefit. Analysis using this model found that, compared to treating all patients with the drug that yields the greatest net health benefit on average (amisulpride), personalizing the selection of antipsychotic drugs for schizophrenia patients would be expected to yield 0.33 QALYs (95% credible interval [crI]: 0.30–0.37) per patient over a 5-year model period at an incremental cost of $4,849/QALY gained (95% crI: dominant–$12,357).
Using the full model, patient-specific risks and preferences both influenced treatment selection. Clozapine, which is the most effective drug but has a relatively severe side effect profile, was generally selected for patients with high baseline Positive and Negative Syndrome Scale (PANSS) scores and preferences for drug efficacy. Amisulpride, which is the second most effective drug and has a better side effect profile, was generally selected for patients with lower baseline PANSS scores and preferences for drug efficacy. Amisulpride was also generally selected for patients with higher risks of or preferences for avoiding weight gain. Other drugs including aripiprazole, haloperidol, and lurasidone were selected for patients with less common sets of risks and preferences. Placebo tended to be selected for patients who are more susceptible to and have stronger preferences for avoiding side effects.
The full model may be difficult to implement in a clinical setting. It requires evaluation of risks and preferences for nine different outcomes for each patient. It is not readily interpretable, so physicians and others may be less likely to trust the recommendations of the model. Additionally, it requires running a relatively complex and computationally expensive simulation model for each patient.
We applied our method to simplify this model. We first used the original model to simulate 10,000 patients that are representative of the adult US population with schizophrenia, as previously described.11 For each patient, we recorded patient features, drug scores, and the recommended drug. We split this dataset into training and test datasets using a 90:10 split.
We trained machine learning models using the training dataset and all patient features. We first followed the procedure described above but omitted the steps for variable selection and variable penalties. We computed model performance on the test dataset in terms of percent of net health benefit of personalization retained compared to using the original model.
We then repeated this process, now including variable selection and variable penalties as described above. We did not penalize inclusion of risks as they are all relatively easy to obtain for this treatment selection problem. We penalized inclusion of each preference by 0.05 QALYs because preferences require time and effort to elicit in clinical settings. These penalties are illustrative; further work could be directed to more accurately determine their values. We consider other penalties in sensitivity analysis. We computed performance on the test dataset as percentage of net health benefit of personalization retained compared to using the original model.
Finally, we interpreted the machine learning models to the extent permitted by their model architectures. For all models we computed standardized variable importance. For ensemble models, for each variable, we computed variable importance as the mean of standardized variable importance for the component models. We plotted the optimal interpretable tree.
All analyses were programmed in R version 3.5.1,45 with input data and statistical code for replication and extension of our analysis published at https://sdr.stanford.edu concurrent with publication.
RESULTS
Figure 2 shows the performance of the metamodels fit without variable selection on the test dataset. Table S1 shows selected hyperparameter values. Personalization using the models was able, in varying degrees, to achieve much of the net health benefit of personalization using the original model as compared to no personalization. Among the interpretable models, the C5.0 decision tree model gained 62.0% as many QALYs as the original model, and the ensemble model gained 62.5% as many QALYs as the original model. As expected, the non-interpretable models generally performed better than the interpretable models. The best-performing non-interpretable models, the gradient boosting machine model and the non-interpretable ensemble model, yielded 96.5% and 96.9% as many QALYs as the original model, respectively.
Figure 2. Metamodel performance without variable selection.

We fit machine learning metamodels on the training dataset without variable selection or penalization of difficult, risky, or costly-to-obtain variables. We defined interpretable models as those that can be easily understood by physicians. We computed model performance on the test dataset as percent of net health benefit, in units of QALYs, compared to full personalization based on all risks and preferences using the original model. Abbreviations include CART: classification and regression tree, CIT: conditional inference tree, C5.0: C5.0 decision tree, ENS-I: interpretable ensemble, GLM: generalized linear model, DL: deep learning model, XRT: extremely randomized trees, DRF: distributed random forest, GBM: gradient boosting machine, and ENS-NI: non-interpretable ensemble.
Table S2 shows the accuracy of the models, computed as the percent of metamodel treatment recommendations that match those from the original model. The best interpretable model was 62.7% accurate in treatment selection, the best interpretable ensemble model was 63.7% accurate, and the best non-interpretable ensemble model was 88.9% accurate. When we compute accuracy assuming that all misclassifications that result in a net health benefit loss of less than 0.05 QALYs (approximately half a month) are in fact correct, these accuracy values increase to 68.8%, 69.6%, and 94.5%, respectively.
Figure 3 shows the results when we fit the metamodels additionally using steps for variable selection and penalties for difficult, risky, or costly-to-obtain variables. We selected the best performing model in each category (interpretable, interpretable ensemble, and non-interpretable ensemble models). Table S3 shows selected hyperparameter values. The best interpretable model gained 60.5% (95% credible interval [crI]: 55.2-65.4) as many QALYs as the original model, the best interpretable ensemble model gained 60.8% (95% crI: 55.5-65.7) as many QALYs, and the best non-interpretable ensemble model gained 83.8% (95% crI: 80.8-86.6) as many QALYs.
Figure 3. Metamodel performance with variable selection: optimal interpretable, interpretable ensemble, and non-interpretable ensemble models.

We fit machine learning metamodels on the training dataset with variable selection and penalization of difficult, risky, or costly-to-obtain variables. We computed model performance on the test dataset as percent of net health benefit, in units of QALYs, compared to full personalization based on all risks and preferences using the original model. The optimal non-interpretable ensemble model included all risks and no preferences. The optimal interpretable ensemble model included five risks. The optimal interpretable model, a C5.0 decision tree, included four risks.
Table S4 shows the accuracy of the models, computed as the percent of metamodel treatment recommendations that match those from the original model. The best interpretable model was 62.2% accurate in treatment selection, the best interpretable ensemble model was 62.3% accurate, and the best non-interpretable ensemble model was 74.9% accurate. When we compute accuracy assuming that all misclassifications that result in a net health benefit loss of less than 0.05 QALYs (approximately half a month) are in fact correct, these accuracy values increase to 68.2%, 68.4%, and 80.8%, respectively.
The models were generally stable to changes in the QALY penalty per preference. In the base case analysis, the penalty per preference was 0.05 QALYs: for each additional preference we included, we decremented the per person QALYs by 0.05. The metamodels were unchanged when the penalty was higher or when the penalty was up to approximately a factor of 5 lower. For example, the best non-interpretable model did not change until the penalty was 0.014 QALYs per preference or less. At this point, the preference for weight gain entered the model. With no penalty, all of the preferences entered the model.
Computation time for the metamodels was orders of magnitude faster than for the original model (Figure S1). The original model took approximately 15 seconds per patient to compute. The metamodels took less than 0.4 seconds per patient.
The performance of the metamodels with variable selection (Figure 3, Table S2) was worse than the performance of the models when variable selection was not used (Figure 2, Table S4) because our method penalizes the use of difficult, risky, or costly-to-obtain variables in the variable selection process. For the non-interpretable ensemble model, the optimal set of variables included all risks but no preferences. The interpretable models selected a subset of these risks likely due in part to the constraint on maximum tree depth. The non-interpretable ensemble model performed better than the interpretable ensemble model, which performed better than the interpretable model. These results make intuitive sense as the more complex model architectures can capture additional structure within the data.
We interpreted our best performing model in each category to the degree allowed by the model architecture. Figure 4 shows variable importance for the optimal interpretable, interpretable ensemble, and non-interpretable ensemble models. Variable importance was similar across the different models, with the most important variables being the risks of three side effects—QTc prolongation, weight gain, and prolactin increase—as well as the patient’s PANSS score (a measure of disease severity). In the interpretable ensemble model, patient age was also important; the non-interpretable ensemble models additionally considered the patient’s sex and race. These results make intuitive sense: the original model allocated the majority of patients to amisulpride and clozapine, and the most important variables for the metamodels were those risks for which the treatment effects for amisulpride and clozapine were most different in our base case analysis.
Figure 4. Variable importance.

We computed standardized variable importance for the optimal interpretable, interpretable ensemble, and non-interpretable ensemble models fit with variable selection. For the ensemble models, for each variable, importance is shown as the mean of standardized variable importance for the component models. Abbreviations include QTc: rate-corrected QT interval and PANSS: Positive and Negative Syndrome Scale score.
Figure 5 shows the best interpretable model, a C5.0 decision tree. Patients are variously allocated to clozapine, amisulpride, aripiprazole, or placebo depending on risk of QTc prolongation, weight gain, and prolactin increase as well as PANSS score. We note that, using the original model, 96.4% of patients were assigned to one of these four treatment options (out of 16 treatment options). Physicians can use the decision tree to select a drug for a patient by following the nodes from top to bottom. Thus, the physician would first determine whether the patient has a QTc interval less than or equal to 425 milliseconds. If so, the physician would follow the branch to the left; if not, the physician would follow the branch to the right. Assuming the patient has a QTc interval less than or equal to 425 milliseconds, the physician would next assess patient weight. If the patient weighs more than 67 kilograms, the physician would follow the branch to the right and see that Drug 2, amisulpride, is recommended for that patient. If the patient weighs less than 50 kilograms, then Drug 1, clozapine, is recommended. If the patient weighs between 50 and 67 kilograms, the appropriate drug depends on the patient’s PANSS score: if the score is below 73 (less severe schizophrenia symptoms), then amisulpride is recommended, and if the score is above 73 (more severe symptoms), then clozapine is recommended. Similar decisions can be determined for patients with a QTc interval greater than 425 milliseconds.
Figure 5. Optimal interpretable model.

We plotted the optimal interpretable model, a C5.0 decision tree, fit with variable selection. Abbreviations include QTc: rate-corrected QT interval and PANSS: Positive and Negative Syndrome Scale score. QTc is in units of milliseconds, weight is in units of kilograms, and prolactin is in units of nanograms per milliliter. The numbers at the terminal nodes refer to the recommended drug in each case. Drug 1 is clozapine, drug 2 is amisulpride, drug 9 is aripiprazole, and drug 16 is placebo.
The personalized treatment decisions as determined by this decision tree gained approximately 60.5% as many QALYs as the original model and allocated similar percentages of patients to the different drugs. The simple model allocated 28.5% to clozapine versus 35.6% in the original model, 57.0% to amisulpride versus 43.0%, 0.0% to haloperidol versus 1.2%, 9.4% to aripiprazole versus 11.4%, 0.0% to lurasidone versus 2.4%, and 5.1% to placebo versus 6.4%.
DISCUSSION
Our method using machine learning to generate metamodels can simplify models for personalization of medical treatment decisions. Our method has three primary benefits: it allows for determination of the optimal degree of personalization, which is important since evaluating all risks and preferences for all patients is often infeasible in a clinical setting; it allows for otherwise complex, black box models to be interpreted; and it reduces the computation time needed to determine a personalized treatment.
In applying our modeling framework to simplify a model for personalized selection of antipsychotic drugs for patients with schizophrenia, we found that much simpler models, with fewer variables and shorter computation time, can be used while still preserving much of the health benefits of full personalization based on all risks and preferences using the original model. As expected, non-interpretable model architectures generally performed better than the interpretable model architectures.
The optimal model likely depends on the health care setting. An interpretable or non-interpretable ensemble machine learning model could be run on a computer in the physician’s office to project the optimal treatment for each patient. An interpretable ensemble machine learning model would likely result in worse health outcomes than a non-interpretable model, but would be easier for physicians to understand and interpret, which might lead to more widespread clinical adoption. Alternatively, a simple, interpretable model such as the decision tree we developed for schizophrenia treatment selection could be used by the physician without a computer to project the optimal treatment for each patient. This approach would likely result in the worst health outcomes of the three approaches but could easily be used and understood by physicians without requiring the use of a computer. Such a model could be implemented in the form of a clinical guideline for personalized treatment, similar to those used for treatment of other conditions such as hypertension.46
Methods are currently being developed to address some of the disadvantages of less interpretable metamodels. For example, methods are being developed to allow such models to be better understood.47 Techniques such as partial dependence plots, local interpretable model-agnostic explanations (LIME), and others can help convince physicians that the models are behaving appropriately. While these advances are helpful and important, such models are generally still more difficult to understand than interpretable models and will likely remain that way. Additionally, user-friendly interfaces are being developed to allow less interpretable metamodels to be more easily used in a clinical setting. For example, such an interface was recently developed to support decisions about infant HIV testing and screening.32 These advances will likely help mitigate this particular limitation of less interpretable models.
There is debate over when metamodel performance is sufficient to replace an original, complex model.32 This question might be best addressed through future empirical research to measure or project real-world consequences of treatment personalization using a metamodel. Outcomes of interest might include model adoption rates, health outcomes, and disparities. From this information, cost-effectiveness analysis and related methods might be used to compare use of a metamodel versus the original model.
Our approach has several limitations. A metamodel shares the same limitations as the original model. For example, if the original model was developed with biased data, then the metamodels will likely learn the same biases. Moreover, metamodels simply learn from the underlying model, so metamodel performance will likely be no better than the performance of the original model. However, it is possible that use of the metamodels may result in better patient health outcomes if they recommend exclusion of variables that are risky to obtain (i.e., could cause adverse events). Importantly, metamodels may also result in better patient health outcomes if they are more widely adopted than the more complex original models. Models for personalizing medical treatment decisions, many of which are complex black boxes, have not been widely adopted to date.
A variety of promising areas for further research remain. Research is needed to evaluate different personalized medicine methods in clinical practice, in particular to compare them and to determine the degree to which they increase patient health outcomes in practice as compared to their projections. Lack of such evidence has been cited as a barrier to the use of personalized treatment models.12 It would be useful to continue to develop methods for evaluating patient-specific risks and preferences that can readily be incorporated into clinical workflows. It would also be useful to apply personalization methods to additional medical treatment decisions, particularly for common chronic conditions where a variety of treatment choices are available.
We have considered the creation of a metamodel to emulate a single complex simulation model. As an alternative approach, different complexities of simulation models could be developed to personalize treatment, and the resulting treatment outcomes (e.g., costs, QALYs) compared. This approach might be viable for simple medical decisions but might be challenging for more complex medical decisions as it would require developing, running, and evaluating a separate simulation model for every possible combination of risks and preferences, and potentially other factors such as structural assumptions. In our illustrative example, which has nine patient-specific risks and preferences, there would be at least 218 = 262,144 models to develop, run, and evaluate if one wanted to comprehensively assess which risks and preferences should be included when making a treatment decision. Moreover, the resulting simulation models would likely be less interpretable than the models we develop and more computationally expensive.
Our method for simplifying models for personalization of medical treatment decisions can generate more readily interpretable and less computationally expensive models that achieve much of the health benefits of full personalization. Simplified models of this type may be more likely to be adopted in a clinical setting than complex models, thereby improving patient outcomes compared to no personalization. Moreover, such simplified models can improve equity in patient health outcomes, as all patients can receive the same personalized, high-quality, evidence-based care regardless of setting. Without use of this framework, only those patients with experienced doctors and in higher-resource settings might receive personalized care.
Supplementary Material
Acknowledgments
Financial support for this study was provided in part by a National Science Foundation Graduate Fellowship (DGE-114747) and in part by a Stanford University Kaseberg Doolan Graduate Fellowship. It was also provided in part by grant R37-DA15612 from the National Institute on Drug Abuse. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.
References
- 1.Tervonen T, van Valkenhoef G, Buskens E, Hillege HL, Postmus D. A stochastic multicriteria model for evidence-based decision making in drug benefit-risk analysis. Stat Med. 2011. May 30;30(12):1419–1428. [DOI] [PubMed] [Google Scholar]
- 2.van Valkenhoef G, Tervonen T, Zhao J, de Brock B, Hillege HL, Postmus D. Multicriteria benefit-risk assessment using network meta-analysis. J Clin Epidemiol. 2012. Apr;65(4):394–403. [DOI] [PubMed] [Google Scholar]
- 3.Tervonen T, Naci H, Valkenhoef G van, Ades AE, Angelis A, Hillege HL, Postmus D. Applying multiple criteria decision analysis to comparative benefit-risk assessment: choosing among statins in primary prevention. Med Decis Making. 2015. Oct 1;35(7):859–871. [DOI] [PubMed] [Google Scholar]
- 4.Wen S, Zhang L, Yang B. Two approaches to incorporate clinical data uncertainty into multiple criteria decision analysis for benefit-risk assessment of medicinal products. Value Health. 2014. Jul;17(5):619–628. [DOI] [PubMed] [Google Scholar]
- 5.Broekhuizen H, Groothuis-Oudshoorn CGM, Hauber AB, Jansen JP, IJzerman MJ. Estimating the value of medical treatments to patients using probabilistic multi criteria decision analysis. BMC Med Inform Decis Mak. 2015. Dec 2;15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lynd LD, Najafzadeh M, Colley L, Byrne MF, Willan AR, Sculpher MJ, Johnson FR, Hauber AB. Using the incremental net benefit framework for quantitative benefit-risk analysis in regulatory decision-making--a case study of alosetron in irritable bowel syndrome. Value Health. 2010. Jul;13(4):411–417. [DOI] [PubMed] [Google Scholar]
- 7.Choi SE, Brandeau ML, Basu S. Dynamic treatment selection and modification for personalised blood pressure therapy using a Markov decision process model: a cost-effectiveness analysis. BMJ Open. 2017. Nov 1;7(11):e018374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Broekhuizen H, IJzerman MJ, Hauber AB, Groothuis-Oudshoorn CGM. Weighing clinical evidence using patient preferences: an application of probabilistic multi-criteria decision analysis. PharmacoEconomics. 2017. Mar;35(3):259–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chim L, Salkeld G, Stockler MR, Mileshkin L. Weighing up the benefits and harms of a new anti-cancer drug: a survey of Australian oncologists. Intern Med J. 2015. Aug;45(8):834–842. [DOI] [PubMed] [Google Scholar]
- 10.Kaltoft MK, Turner R, Cunich M, Salkeld G, Nielsen JB, Dowie J. Addressing preference heterogeneity in public health policy by combining cluster analysis and multi-criteria decision analysis: proof of method. Health Econ Rev. 2015;5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Weyant C, Brandeau ML, Basu S. Personalizing medical treatment decisions: integrating meta-analytic treatment comparisons with patient-specific risks and preferences. Med Decis Making. 2019. Nov 9;39(8):998–1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, Maathuis M, Moreau Y, Murphy S, Przytycka T, Rebhan M, Röst H, Schuppert A, Schwab M, Spang R, Stekhoven D, Sun J, Weber A, Ziemek D, Zupan B. From hype to reality: data science enabling personalized medicine. BMC Med. 2018. Aug 27;16(1):150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med. Nature Publishing Group; 2019. Jan;25(1):30–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Watson DS, Krutzinna J, Bruce IN, Griffiths CE, McInnes IB, Barnes MR, Floridi L. Clinical applications of machine learning algorithms: beyond the black box. BMJ. 2019. Mar 12;364:1886. [DOI] [PubMed] [Google Scholar]
- 15.Neumann P, Kim D, Trikalinos T, Sculpher M, Salomon J, Prosser L, Owens D, Meltzer D, Kuntz K, Krahn M, Feeny D, Basu A, Russell L, Siegel J, Ganiats T, Sanders G. Future directions for cost-effectiveness analyses in health and medicine. Med Decis Making. 2018. Oct;38(7):767–777. [DOI] [PubMed] [Google Scholar]
- 16.Jalal H, Dowd B, Sainfort F, Kuntz K. Linear regression metamodeling as a tool to summarize and present simulation model results. Med Decis Making. 2013. Oct;33(7):880–890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yuan J, Nian V, Su B, Meng Q. A simultaneous calibration and parameter ranking method for building energy models. Appl Energy. 2017. Nov 15;206:657–666. [Google Scholar]
- 18.O’Hagan A Bayesian analysis of computer code outputs: a tutorial. Reliab Eng Syst Saf. 2006. Oct 1;91(10):1290–1300. [Google Scholar]
- 19.Wang GG, Shan S. Review of metamodeling techniques in support of engineering design optimization. J Mech Des. 2006. May 4;129(4):370–380. [Google Scholar]
- 20.Kleijnen JPC, Sargent RG. A methodology for fitting and validating metamodels in simulation. Eur J Oper Res. 2000. Jan 1;120(1):14–29. [Google Scholar]
- 21.Strong M, Oakley JE, Brennan A. Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample: a nonparametric regression approach. Med Decis Making. 2014. Apr;34(3):311–326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Strong M, Oakley JE, Brennan A, Breeze P. Estimating the expected value of sample information using the probabilistic sensitivity analysis sample: a fast, nonparametric regression-based method. Med Decis Making. 2015. Jul;35(5):570–583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jalal H, Goldhaber-Fiebert JD, Kuntz KM. Computing expected value of partial sample information from probabilistic sensitivity analysis using linear regression metamodeling. Med Decis Making. SAGE Publications Inc STM; 2015. Jul 1;35(5):584–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Brennan A, Chilcott J, Kharroubi S, O’Hagan A. A two level Monte Carlo approach to calculating expected value of perfect information: resolution of the uncertainty in methods. 24th Annual Meeting of the Society for Medical Decision Making; 2002. Oct 23; Baltimore, MD. [Google Scholar]
- 25.Degeling K, IJzerman MJ, Lavieri MS, Strong M, Koffijberg H. Introduction to metamodeling for reducing computational burden of advanced analyses with health economic models: a structured overview of metamodeling methods in a 6-step application process. Med Decis Making. 2020. Apr;40(3):348–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rojnik K, Naverţnik K. Gaussian process metamodeling in Bayesian value of information analysis: a case of the complex health economic model for breast cancer screening. Value Health. 2008;11(2):240–250. [DOI] [PubMed] [Google Scholar]
- 27.Andrianakis I, Vernon IR, McCreesh N, McKinley TJ, Oakley JE, Nsubuga RN, Goldstein M, White RG. Bayesian history matching of complex infectious disease models using emulation: a tutorial and a case study on HIV in Uganda. PLOS Comput Biol. 2015. Jan 8;11(1):e1003968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tappenden P, Chilcott J, Eggington S, Oakley J, McCabe C. Methods for expected value of information analysis in complex health economic models: developments on the health economics of beta interferon and glatiramer acetate for multiple sclerosis. Health Technol Assess. 2004. Jun 25;8(27). [DOI] [PubMed] [Google Scholar]
- 29.Merz JF, Small MJ, Fischbeck PS. Measuring decision sensitivity: a combined Monte Carlo - logistic regression approach. Med Decis Making. 1992. Aug 1;12(3):189–196. [DOI] [PubMed] [Google Scholar]
- 30.Stevenson MD, Oakley J, Chilcott JB. Gaussian process modeling in conjunction with individual patient simulation modeling: a case study describing the calculation of cost-effectiveness ratios for the treatment of established osteoporosis. Med Decis Making. 2004. Jan 1;24(1):89–100. [DOI] [PubMed] [Google Scholar]
- 31.Alam MF, Briggs A. Artificial neural network metamodel for sensitivity analysis in a total hip replacement health economic model. Expert Rev Pharmacoecon Outcomes Res. 2020. Nov 1;20(6):629–640. [DOI] [PubMed] [Google Scholar]
- 32.Soeteman DI, Resch SC, Jalal H, Dugdale CM, Penazzato M, Weinstein MC, Phillips A, Hou T, Abrams EJ, Dunning L, Newell M-L, Pei PP, Freedberg KA, Walensky RP, Ciaranello AL. Developing and validating metamodels of a microsimulation model of infant HIV testing and screening strategies used in a decision support tool for health policy makers. MDM Policy Pract. 2020. Jun;5(1):2381468320932894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Koffijberg H, Degeling K, IJzerman MJ, Coupé VMH, Greuter MJE. Using metamodeling to identify the optimal strategy for colorectal cancer screening. Value Health. 2021. Feb 1;24(2):206–215. [DOI] [PubMed] [Google Scholar]
- 34.Zhong H, Brandeau M, Yazdi G, Wang J, Nolen S, Hagen L, Thompson W, Assoumou S, Linas B, Salomon J. Metamodeling for policy simulations with multivariate outcomes. Working paper, Stanford University, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Timofeev R Classification and regression trees (CART) theory and applications. [Berlin: ]: Humboldt University; 2004. [Google Scholar]
- 36.Hothorn T, Hornik K, Zeileis A. ctree: conditional inference trees. Compr R Arch Netw. 2015;1–34. [Google Scholar]
- 37.Kuhn M, Johnson K. Classification models. In: Applied Predictive Modeling. New York, NY: Springer; 2013. p. 245–460. [Google Scholar]
- 38.Doupe P, Faghmous J, Basu S. Machine Learning for Health Services Researchers. Value Health. 2019. Jul 1;22(7):808–815. [DOI] [PubMed] [Google Scholar]
- 39.Wolpert D Stacked generalization. Neural Netw. 1991;5(2):241–259. [Google Scholar]
- 40.Breiman L Stacked regressions. Mach Learn. 1996;24:49–64. [Google Scholar]
- 41.van der Laan M, Polley E, Hubbard A. Super Learner. J Am Stat Appl Genet Mol Biol. 2007;6. [DOI] [PubMed] [Google Scholar]
- 42.Polley EC, Rose S, van der Laan MJ. Super learning. In: Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY: Springer; 2011. p. 43–66. [Google Scholar]
- 43.Saloniki E-C, Malley J, Burge P, Lu H, Batchelder L, Linnosmaa I, Trukeschitz B, Forder J. Comparing internet and face-to-face surveys as methods for eliciting preferences for social care-related quality of life: evidence from England using the ASCOT service user measure. Qual Life Res. 2019. Aug 1;28(8):2207–2220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Neumann PJ, Cohen JT, Weinstein MC. Updating cost-effectiveness--the curious resilience of the $50,000-per-QALY threshold. N Engl J Med. 2014. Aug 28;371(9):796–797. [DOI] [PubMed] [Google Scholar]
- 45.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Internet]. 2018. Available from: https://www.R-project.org/ [Google Scholar]
- 46.James PA, Oparil S, Carter BL, Cushman WC, Dennison-Himmelfarb C, Handler J, Lackland DT, LeFevre ML, MacKenzie TD, Ogedegbe O, Smith SC, Svetkey LP, Taler SJ, Townsend RR, Wright JT, Narva AS, Ortiz E. 2014 Evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). JAMA. 2014. Feb 5;311(5):507–520. [DOI] [PubMed] [Google Scholar]
- 47.Hall P, Ambati S, Phan W. Ideas on interpreting machine learning [Internet]. O’Reilly Media. 2017. [cited 2021 Mar 20]. Available from: https://www.oreilly.com/radar/ideas-on-interpreting-machine-learning/ [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
