Empirical Treatment Effectiveness Models for Binary Outcomes

Jarrod E Dalton; Neal V Dawson; Daniel I Sessler; Jesse D Schold; Thomas E Love; Michael W Kattan

doi:10.1177/0272989X15578835

. Author manuscript; available in PMC: 2017 Jan 1.

Published in final edited form as: Med Decis Making. 2015 Apr 7;36(1):101–114. doi: 10.1177/0272989X15578835

Empirical Treatment Effectiveness Models for Binary Outcomes

Jarrod E Dalton ¹, Neal V Dawson ², Daniel I Sessler ³, Jesse D Schold ⁴, Thomas E Love ⁵, Michael W Kattan ⁶

PMCID: PMC4596743 NIHMSID: NIHMS669769 PMID: 25852080

Abstract

Randomized trials provide strong evidence regarding efficacy of interventions but are limited in their capacity to address potential heterogeneity in effectiveness within broad clinical populations. For example, a treatment that on average is superior may be distinctly worse in certain patients. We propose a technique for using large electronic health registries to develop and validate decision models which measure – for distinct combinations of covariate values – the difference in predicted outcomes among two alternative treatments. We demonstrate the methodology in a prototype analysis of in-hospital mortality under alternative revascularization treatments.

First, we developed prediction models for a binary outcome of interest for each treatment. Decision criteria were then defined based on the treatment-specific model predictions. Patients were then classified as receiving concordant or discordant care (in relation to the model recommendation) and the association between discordance and outcomes was evaluated. We then present alternative decision criteria and validation methodologies, as well as sensitivity analyses which investigate a) imbalance between treatments on observed covariates and b) the aggregate impact of unobserved covariates.

Our methodology supplements population-average clinical trial results by modeling heterogeneity in outcomes according to specific covariate values. It thus allows for assessment of current practice, from which cogent hypotheses for improved care can be derived. Newly emerging large population registries will allow for accurate predictions of outcome risk under competing treatments, as complex functions of predictor variables. Whether or not the models might be used to inform decision making depends on the extent to which important predictors are available. Further work is needed to understand the strengths and limitations of this approach, particularly in relation to those based on randomized trials.

Randomized trials provide excellent evidence for the expected (or average) response to treatment in a given tested population. However, even within tightly selected populations, there can be considerable individual response variability. Specific individuals may have worsened outcomes as a result of treatment even if their group, on average, did better. This phenomenon, known as “heterogeneity in treatment effect,” is especially problematic in typical clinical populations which are usually less controlled and homogeneous than those in most clinical trials. Clinicians making decisions on the basis of the best available trial evidence may assign some patients to treatments that actually worsen their outcomes — even when those exact treatments, on average, improve health.

Translation of mechanistic knowledge acquired from randomized trials into effective treatment strategies for individual patients is critical for achieving improvements in health. Comparative effectiveness research [1] aimed at identifying the clinical and environmental conditions where interventions are most likely to work can help to ensure successful translation. Methodological development for performing comparative effectiveness research is ongoing. A common approach is to stratify the sample and estimate group-level treatment effects within various subpopulations, but in practice it is often not possible to study even a small fraction of the potentially relevant sub-groups due to reduced power. Analyses that are specific to distinct combinations of disease-related characteristics and patient comorbidities can potentially provide more accurate representations of treatment benefit. However, it remains unclear which statistical techniques are most appropriate for patient-centered outcomes research [2].

Our primary goal was to propose a practical and empirical technique for building and validating clinical decision models that: a) recommend one of two (or more) competing treatment alternatives that will optimize the predicted outcome; b) estimate the relative odds of outcome under competing treatments, given a specific combination of input variables; and, c) evaluate existing clinical decision making practices against the recommendations set forth by the model. We believe our methodology may help in identifying potential pathways and barriers to treatment effectiveness, stimulating further research.

We chose the decision between percutaneous coronary intervention (PCI) and coronary artery bypass grafting (CABG) with respect to minimizing risk of postoperative in-hospital mortality as a motivating setting. The choice between these two procedures for a given patient is nuanced and involves consideration of both risk factors and various aspects of their cardiovascular disease. Current guidelines set forth by joint task forces involving the American College of Cardiology Foundation and the American Heart Association [3, 4] as well as the European Society of Cardiology and the European Association for Cardio-Thoracic Surgery [5] reflect this complexity and may therefore be difficult to incorporate into individual practice. Furthermore, the guidelines are based mostly on population-level comparisons without consideration of individual treatment effects (though guidelines for certain subpopulations such as those with end-stage renal disease or Type II diabetes mellitus have been incorporated).

The purpose of our decision modeling presentation is to better describe the underlying methodology, not to inform actual health care decisions. Though in-hospital mortality is an important consideration for patients undergoing revascularization, it is but one component of a complex risk profile, and the decision should incorporate all these risks.

Modeling Methodology

We consider the decision between two competing treatments; extensions to three or more competing treatments are straightforward. While we consider only binary outcomes in this article, the methodology could be modified to accommodate continuous and time-to-event outcomes.

Our modeling approach incorporates two phases: development (Phase I) and validation (Phase II). In the development phase, separate, treatment-specific prediction models for an adverse outcome of interest such as mortality are established and calibrated. We then define decision criteria and related measures, proposing three possible approaches, namely: a) binary recommendation of the treatment for which predicted probability of the undesired outcome is minimized; b) definition of a decision model odds ratio, equal to the predicted odds of the outcome under one treatment divided by the predicted odds of the outcome under the competing treatment (or, alternatively, a risk ratio equal to the ratio of predicted probabilities); and, c) definition of a small number of categories based on the models’ predicted probabilities and the decision model odds ratio. In the validation phase, a new data set is used to compare recommendations from the models against observed decisions made in the standard process of care. As we detail later, one proposed validation technique is to evaluate the association between discordance (patients given the opposite treatment of that recommended by the model) and outcome. We also propose a calibration analysis for the decision model odds ratio and sensitivity analyses investigating the potential effects of imbalance on covariates between treatments and the potential (aggregate) effects of unobserved covariates on decision model odds ratio estimates.

Data Considerations

We analyzed data on 586,754 hospital inpatient discharges who had either PCI or CABG as their primary procedure. These data were extracted from the following 2009–2011 U.S. Agency for Healthcare Research and Quality State Inpatient Databases: Arizona, California, Florida, Iowa, Maryland, Michigan, and New Jersey. For the purposes of inclusion in the study, discharges with an International Classification of Diseases and Injuries, Version 9, Clinical Modification (ICD-9-CM) primary procedure code of 36.1× (bypass anastomosis for heart revascularization) were considered as CABG discharges, while discharges with an ICD-9-CM primary procedure code of 36.09 (coronary angioplasty not otherwise specified) or 00.66 (percutaneous transluminal coronary angioplasty or coronary atherectomy) were considered as PCI discharges.

As our proposed models are intended for application in large and diverse population registries, we felt that random data partitioning provided the most straightforward way to implement the proposed decision modeling procedure; however, cross validation or bootstrap resampling with out-of-sample test error estimation [6–8] could in principle be used. For our prototype analysis regarding the decision between PCI and CABG, the discharge records under each treatment were randomly divided into three datasets prior to modeling. The first two datasets were used for development (50% of that treatment’s discharge records) and calibration (25% of that treatment’s discharge records) of that treatment’s probability model for in-hospital mortality, respectively (Phase I). The remaining 25% of the discharge records were allocated to a (combined) validation dataset to evaluate outcomes as they relate to model recommendations and actual treatments administered (Phase II).

Phase I: Model Development

For each given treatment, we developed a probability model for the outcome using the treatment’s model development dataset. Specifically, we obtained the following functions expressing the estimated probability of outcome given a set of covariates:

{\hat{p}}^{(1)} (X) = \overset{⌢}{Pr} (Y = 1 | X = x, T = 1) and

{\hat{p}}^{(2)} (X) = \overset{⌢}{Pr} (Y = 1 | X = x, T = 2) .

In these equations, Y is an indicator variable for the outcome, X is a vector of input variables (or covariates), and T is a treatment indicator (e.g., T = 1 for PCI and T = 2 for CABG). A standard approach for building these probability models is multivariable logistic regression, although other predictive modeling algorithms such as random forests [9] or elastic net [10] could be used.

In our PCI vs. CABG example, we used elastic net logistic regression to estimate each of the two probability functions.¹ Elastic net works by purposely biasing (“shrinking”) regression coefficients towards zero; coefficients for variables not independently associated with the outcome are shrunken all the way to a value of zero (in order to implement variable selection), while coefficients for correlated predictors are averaged together (in order to prevent the adverse effects of multicollinearity on precision of coefficient estimates). Considered as predictors for these models were age, sex, and the set of present-on-admission ICD-9-CM diagnosis codes for each patient. As ICD-9-CM diagnosis codes are very detailed and involve over 14,000 separate categories, we hierarchically aggregated them by truncating trailing digits from sparsely-represented codes. Truncation of the fifth and subsequently the fourth digit from a given code occurred when that code was represented by fewer than 500 patients in the overall dataset (prior to partitioning). This method, which is described in detail elsewhere [11], resulted in 747 distinct diagnosis-related predictors that were considered for each of the two treatment-specific elastic net logistic regression models. Thirty predictors were selected for the CABG model and 68 were selected for the PCI model. Odds ratio parameter estimates from the two models are presented in Figure 1.

Odds ratio estimates for in-hospital mortality for both the CABG and PCI prediction models, based on elastic net logistic regression. Results sorted by ascending odds ratio estimate under the PCI model, which included 68 variables; the CABG model included 30 variables. The plot excludes estimates relating to palliative care due to space issues; this predictor was selected for both prediction models and the estimated odds ratio was 102 for the CABG model and 32 for the PCI model. MI = myocardial infarction.

A common graphical method for evaluating the calibration of probability models is to plot the observed incidence of the outcome as a function of the model-predicted probability [12]. However, these methods only indicate lack of fit, as opposed to remedying miscalibration. Recently, we proposed a flexible modeling technique for bias-correcting (or recalibrating) clinical prediction models [13]. We used this method to recalibrate both the CABG and PCI probability models. Restricted cubic splines (with two knots) were incorporated in order to model nonlinearities in the calibration curves. Details of the calibration procedure for the individual probability models are given in Appendix 1.

Calibration for each of the two models was generally good, and improved upon model recalibration (Figure 2). Discrimination of the CABG model was moderate (concordance index, or C-statistic, of 0.77) and discrimination of the PCI model was good (C=0.91).

Calibration of probability models and p̂⁽¹⁾(X) and p̂⁽²⁾(X). The left panel represents calibration of the original models within the calibration dataset. This calibration curve was used to adjust estimates and the calibration curve was re-estimated within the validation dataset (right panel). The figure indicates that calibration was generally good for both models after the recalibration step.

Decision Criteria and Related Measures

Next, we detail the three possible approaches mentioned above for describing the multiple predicted outcomes among competing treatments.

1) Binary Recommended Treatment

In the first approach, the recommended treatment for a patient with given values of the predictor variables is simply the one for which the predicted probability of the outcome is lower.² While this approach ignores quantifications of absolute and relative risk derived from each treatment-specific model, a binary recommendation for one of the two competing treatments is the easiest to interpret and allows for straightforward analyses of observed versus model-recommended decisions.

With this approach, we can define concordant and discordant treatments as follows: Let R_j and O_j represent the model-recommended and observed treatment for patient i in the validation dataset, respectively. A concordant treatment is one for which O_j = R_j; likewise, a discordant treatment is one for which O_j ≠ R_j. We use the symbol D_j to represent discordance; that is, D_j = 1 if O_j ≠ R_j.

For our PCI vs. CABG analysis, we found that, among the 146,753 discharges in the validation dataset, 9,304 (6.3%) were recommended for CABG and 137,449 (93.7%) were recommended for PCI. Overall, there were 39,039 discordant discharges (26.6% of those in the validation dataset). Discordance was observed for 7,838/9,304 (84.2%) of those recommended for CABG and 31,201/137,449 (22.7%) of those recommended for PCI.

The incidence [Bonferroni-adjusted 95% confidence interval] of in-hospital mortality was 0.68% [0.62% ― 0.74%] among concordant discharges and 2.8% [2.6% ― 3.0%] among discordant discharges, corresponding to an odds raOo [95% confidence interval] of 4.2 [3.8 ― 4.6]. That is, the odds of mortality were approximately four times greater among discordant discharges than among concordant discharges.

2) Decision Model Odds Ratio

In the second approach, a covariate-specific decision model odds ratio θ̈_i (where the subscript i indexes observations in the validation dataset) is defined as follows:

{\hat{θ}}_{i} = \frac{{\hat{p}}^{(2)} (X_{i}) \div (1 - {\hat{p}}^{(2)} (X_{i}))}{{\hat{p}}^{(1)} (X_{i}) \div (1 - {\hat{p}}^{(1)} (X_{i}))} .

Alternatively, this may be expressed as a risk ratio (ratio of probabilities); however, an advantage of using the odds ratio formulation is that it more readily allows assessment of calibration (see “Phase II - Model Validation” below). When absolute risks are low, as in the case of many clinical endpoints, the odds ratio and risk ratio are close to one another.

3) Other Approaches

The decision model odds ratio is a relative quantity; when absolute risks are clinically low, a ratio very different from 1.0 might not represent meaningful absolute differences in likelihood of outcome. In light of this, we describe a third approach whereby a set of discrete strata are defined according to the values of p̂⁽¹⁾(X_i), p̂⁽²⁾(X_i), and θ̈_i. For example, the following strata were pre-specified for our PCI vs. CABG example (see Figure 3):

low risk, defined as both and p̂⁽¹⁾(X_i) < 0.005 and p̂⁽²⁾(X_i) < 0.005;
large predicted benefit under CABG, defined as not low risk with ${\hat{θ}}_{i}^{- 1} > 2.0$
small predicted benefit under CABG, defined as not low risk with $1.3 < {\hat{θ}}_{i}^{- 1} \leq 2.0$ ;
equivocal risk, defined as not low risk with each models’ probabilities within ±30% of one another, i.e., 1.3⁻¹ ≤ θ̈_i < 1.3;
small predicted benefit under PCI, defined as 1.3 ≤ θ̈_i < 2.0; and
large predicted benefit under PCI, defined as θ̈_i ≥ 2.0.

Choice of the thresholds that govern assignment to these strata is subject to interpretation and should represent clinically meaningful values.

Schematic example of discrete risk strata for the PCI vs. CABG decision model. Pre-specified values of the decision model odds ratio θ̈_i characterize thresholds which, together with the pair of model predictions under each treatment p̂⁽¹⁾ (X) (predicted probability under the PCI model) and p̂⁽²⁾(X) (predicted probability under the CABG model), define the strata. Stratum #1 was specified as low-risk under both models (i.e., both predictions <0.005.

Table 1 contains results of this stratified analysis. Most cases (53.4%) were predicted to have large benefit under PCI relative to CABG (category 6 above). In this group, there was a 23.8% discordance rate; that is, 23.8% of patients received CABG despite the large predicted benefit of PCI. These patients were 3.2 (95% CI: 2.6 — 3.9) times more likely to die in the hospital than patients receiving concordant PCI treatment. The next largest category was those at low predicted risk under both treatments (category 1 above). Even though absolute risk in this group was low, with mortality of 0.4% for patients receiving CABG and 0.1% for patients receiving PCI, discordance was still associated with an odds ratio for in-hospital mortality of 3.5 (1.9 — 6.5]. Those at equivocal levels of predicted risk under each treatment did not indicate significantly different odds of mortality (odds ratio of 1.2 [0.9 — 1.6]).

Table 1.

Summary and results of observed treatment decisions by evidence strata for the PCI vs. CABG example, using the validation dataset (N = 146,753). Discordant-vs.-concordant odds ratios compare, within each stratum, patients who were administered the treatment with higher predicted in-hospital mortality (either PCI or CABG) to patients who were administered the treatment with lower predicted mortality.

Risk Stratum	N (%) of all discharges in validation dataset	Actual Treatment N (% of risk stratum)		Incidence [95% CI] of In-Hospital Mortality by Actual Treatment (%)		Odds Ratio [95% CI] Discordant vs. Concordant
		- CABG -	- PCI -	- CABG -	- PCI -
*Low Risk* Predicted probabilities <0.5% for both CABG and PCI	48,303 (32.9%)	10,738 (22.2%)	37,565 (77.8%)	0.4 [0.2, 0.5]	0.1 [0.1, 0.1]	3.5 [1.9, 6.5]
*Large predicted benefit under CABG^† Predicted odds under PCI >2.0×* the predicted odds under CABG	2,561 (1.7%)	435 (17.0%)	2,126 (83.0%)	7.6 [4.2, 10.9]	16.8 [14.7, 19.0]	2.5 [1.5, 4.1]
*Small predicted benefit under CABG^† Predicted odds under PCI 1.3× – 2.0×* the predicted odds under CABG	2,687 (1.8%)	448 (16.7%)	2,239 (83.3%)	5.8 [2.9, 8.7]	6.4 [5.1, 7.8]	1.1 [0.6, 2.0]
*Equivocal^† Predicted odds within* ±30% of each another	6,990 (4.8%)	1,022 (14.6%)	5,968 (85.4%)	5.5 [3.6, 7.4]	4.5 [3.8, 5.2]	1.2 [0.9, 1.6]
*Small predicted benefit under PCI^† Predicted odds under CABG 1.3× – 2.0×* the predicted odds under PCI	7,843 (5.3%)	1,347 (17.2%)	6,496 (82.8%)	4.8 [3.3, 6.4]	2.4 [1.9, 2.9]	2.1 [1.4, 3.1]
*Large predicted benefit under PCI^† Predicted odds under CABG >2.0×* the predicted odds under PCI	78,369 (53.4%)	18,677 (23.8%)	59,692 (76.2%)	1.7 [1.4, 1.9]	0.5 [0.4, 0.6]	3.2 [2.6, 3.9]

Open in a new tab

^†

Assumes that the patient is not low risk.

Finally, direct evaluation of the vector of probabilities [p̂⁽¹⁾ (X_i), p̂⁽²⁾ (X_i)] is straightforward and may be appealing for applications in which patients are heterogeneous with respect to both absolute and relative risk measures under each treatment.

Phase II – Model Validation

In the context of our proposed methodology, model validation generally entails the comparison of actual decisions made during the course of care to predictive information given by the decision model, whether the predictive information is given in the form of a recommended treatment, a decision model odds ratio, or a risk stratum.

The (global) effect of discordance is then defined as the odds ratio for the outcome comparing discordant decisions to concordant decisions. This can be estimated using a standard logistic regression model; furthermore, hypotheses involving variations in the incidence of discordance and variations in the discordant-to-concordant odds ratio for outcome across subpopulations can be studied by incorporating covariates into this logistic regression model. A simple extension to this logistic model which incorporates a main effect term for discordance, a main effect term representing the various risk strata, as well as their interaction allows the estimation of strata-specific discordant-to-concordant odds ratios. We estimated both global and stratum-specific effects of discordance for our PCI vs. CABG analysis using the combined validation dataset.

Key to validating the decision model odds ratio θ̈_i is addressing the question of whether or not the increment (or decrement) in odds described by the metric actually holds when the discordant treatment is administered. In other words, we evaluate the relationship between the decision model odds ratio θ_i and the outcomes of discordant treatment. We propose a technique for evaluating the calibration of the decision model odds ratio θ_i within the validation dataset, using an analysis akin to the recalibration methods we described previously [13]. The goal is to generate a calibration curve that allows for graphical assessment against the horizontal line given at y = 0, which represents perfect calibration of θ̈_i. The required derivations for producing this curve are given in Appendix 1 and Appendix 2. We used a three-knot cubic spline decision model odds ratio calibration model within our PCI vs. CABG validation dataset.

Calibration of the discordant-to-concordant odds ratio θ̂_i was generally good (Figure 4). Slight miscalibrations for extremely high odds ratios (i.e., odds ratios greater than 10) were suggested by the curve. These miscalibrations were not large enough to switch the recommended treatment and could have simply been the result of estimation issues in the tail of the spline curve.

Calibration curve for the decision model odds ratio θ̂. Values of θ̂ greater than one suggest that the odds of in-hospital mortality under CABG are θ̂ times greater than the odds under PCI, and values of θ̂ less than one suggest the odds under PCI are θ̂⁻¹ times greater than the odds under CABG. Perfect calibration is indicated by the horizontal line given at y = 0. Shaded bands represent pointwise 95% confidence intervals. The observed-to-expected (O/E) ratio plotted on the y-axis can be interpreted as follows: An O/E ratio of 0.5 implies that true odds of in-hospital mortality among discordant patients was half of that described by the decision model odds ratio θ̂. For instance, when θ̂ = 10, the calibration model indicated an O/E ratio of approximately 0.8; thus, the true increase in odds associated with discordant decisions was a factor of eight instead of a factor of ten.

It is also important to consider the potential impact of the model at the population level. Whereas in some cases the mortality reduction is clinically inconsequential for individual patients, applications to populations of patients may introduce meaningful reductions in “statistical lives” [14]. For instance, approximately 1 million revascularization procedures occur annually in the United States [15]. We estimated expected mortality per 1 million patients treated under two scenarios: a) assuming the proportion of patients within each risk stratum administered each procedure is equal to that observed in our randomly-selected test dataset, and b) assuming patients are treated according to the model recommendation. Results, which are displayed in Table 2, suggest that “statistical averted mortality” – defined in this case as the potential number of in-hospital deaths avoided when decisions are all concordant with the model recommendation – might be as high as 3–4 thousand deaths per annum in the United States.

Table 2.

Observed mortality and predicted mortality with perfect concordance to the decision model by risk stratum. Estimates reflect number of deaths per 1 million patients treated, which is approximately the number of revascularization procedures annually performed in the United States [15]. Statistical averted mortality estimates assume the predicted incidence of mortality under the modelrecommended treatment for each risk stratum; these are reported in Table 1. An alternative calculation for the estimated number of deaths under the recommended treatment which simply uses the sum of all patients’ predicted probability of in-hospital mortality under the recommended treatment yielded a total expected mortality of 7,817 (instead of 9,157), for a statistical averted mortality of 4,489 per 1 million treated.

	Estimated Number of Deaths
Risk Stratum	Estimated Number of Patients in Risk Stratum (per 1 million)	Under Actual Treatment Allocation (A)	Under Recommended Treatment Allocation (B)	Statistical Averted Mortality (A–B)
*Low Risk* Predicted probabilities <0.5% for both CABG and PCI	329,145	518	343	175
*Large predicted benefit under CABG^† Predicted odds under PCI >2.0×* the predicted odds under CABG	17,451	2,664	1,324	1,340
*Small predicted benefit under CABG^† Predicted odds under PCI 1.3× – 2.0×* the predicted odds under CABG	18,310	1,158	1,063	95
*Equivocal^† Predicted odds within* ±30% of each another	47,631	2,221	2,350	−129
*Small predicted benefit under PCI^† Predicted odds under CABG 1.3× – 2.0×* the predicted odds under PCI	53,444	1,486	1,259	227
*Large predicted benefit under PCI^† Predicted odds under CABG >2.0×* the predicted odds under PCI	534,020	4,259	2,818	1,441

-- TOTALS --	*1,000,001^	*12,306*	*9,157*	*3,149*

Open in a new tab

^†

Assumes that the patient is not low risk.

This is not equal to 1,000,000 due to rounding error.

Sensitivity Analyses

1) Influence of Covariates on Treatment Selection

Our proposed decision modeling methodology works by creating two multidimensional functions relating patient characteristics to the predicted probability of outcome – one function for CABG and another for PCI. When decisions systematically occur in response to patients’ comorbid status, estimation may be inaccurate due to non-overlap in covariate distributions. To evaluate this phenomenon, we conducted a sensitivity analysis which first estimated propensity scores (defined as each patients’ probability of receiving CABG given the values of their baseline covariates) and then evaluated performance of the decision model as a function of the propensity score. The propensity score model was estimated using the combined training dataset and applied to the observations in the validation dataset for analysis. Then, using the validation dataset, we estimated the overall discordant-to-concordant odds ratio as a function of the propensity score using cubic spline logistic regression.

The observed C-statistic for the propensity model – applied to the observations in the validation dataset – was 0.77, indicating weak-to-moderate ability of the observed baseline covariates to predict observed treatments. Propensity scores overlapped substantially (Figure 5), ranging between 0.1% and 99.9% for patients treated with CABG and ranging between <0.1% and 99.8% for patients treated with PCI. The discordant-to-concordant odds ratio was greatest among patients with low propensity scores (i.e., those whose covariates predicted that they were likely to receive PCI) and gradually declined as the propensity score increased. Based on the pointwise confidence interval estimates of the curve, the odds ratio remained significantly greater than one for all whose propensity scores were less than 0.64, which amounted to approximately 97% of cases in the validation cohort.

Results of our first sensitivity analysis which evaluated potential effects of imbalanced covariates between observed treatments. Propensity scores (i.e., predicted probability of receiving CABG based on each patients’ combination of baseline characteristics) were estimated for each of the discharges in the validation cohort. In the figure, histograms describing each (actual) treatment group’s propensity score distribution underlie a plot of the estimated overall discordant-to-concordant odds ratio vs. propensity score. The curve indicates that, for all patients with propensity scores less than 0.64 – approximately 97% of cases in the validation cohort – discordance was still associated with increased odds of in-hospital mortality. On the other hand, the decision model may not be appropriate for certain patients strongly indicated for CABG (i.e., those with propensity scores >0.64).

2) Major Unobserved Covariates

Another sensitivity analysis was carried out to evaluate the potential effect of an unobserved binary covariate on the results. The occurrence of this unobserved covariate, which we denote as X̃, was assumed to be independent of the observed covariates in the analysis. Likewise, the effect of as X̃ on the predicted odds of in-hospital mortality under each treatments’ model – which we represent using an odds ratio ζ_CABG for the CABG model and an odds ratio ζ_PCI for the PCI model – were assumed to be independent from the effects of the observed covariates. These independence assumptions had the effect of representing the effect of the unobserved variable as a worst-case scenario, since correlation between assignment to as X̃ = 1 (presence of as X̃) and observed covariates as well as correlation between the effects of as X̃ and the effects of observed covariates would lessen impact.

For example, suppose a patient was predicted to have a probability of in-hospital mortality under CABG of 10% (odds of 0.1 ÷ 0.9 = 1/9), and a predicted probability under PCI of 5% (odds of 1/19). If the covariate as X̃ is present (as X̃ = 1) the adjustment factor for CABG is ζ_CABG = 1/2, then the updated odds of in-hospital mortality given the presence of the unobserved covariate as X̃ are (1/9)(1/2) = 1/18. Converting the updated odds back to the probability scale (using the formula probability =1 ÷ [1+ odds⁻¹]), we find that the updated probability under CABG is approximately 5.3%. A similar calculation for PCI assuming ζ_PCI = 2 in this hypothetical patient results in an updated probability under PCI of 9.5%. Thus, for this patient, the recommended treatment is switched from PCI to CABG, which (given their actual treatment) switches their discordance status under the decision model.

From the above notation and corresponding example, we note that evaluation of both the effect of the unobserved covariate on individuals undergoing CABG (ζ_CABG) and the effect of the unobserved covariate on patients undergoing PCI (ζ_PCI) can be simplified by studying their ratio, i.e., γ = ζ_CABG ÷ ζ_PCI. Here, the parameter γ is interpreted as the differential effect of the missing covariate between the two treatments; when γ = 0.5, the missing covariate is assumed to affect in-hospital mortality twice as severely among patients undergoing PCI than among patients undergoing CABG, and the converse is true when γ = 2.

Using the validation dataset, we conducted a Monte Carlo simulation which generated repeated datasets, randomly assigning patients to have X̃ = 1 in each dataset. Ten thousand datasets were generated, each incorporating a randomly-chosen prevalence value for X̃ (based on the random uniform distribution). To each simulated dataset we applied a series of differential effect parameters ranging from to γ = 1/4 to γ = 4; for each differential effect parameter we then re-estimated the discordant-to-concordant odds ratio. Results of these manipulations on the overall discordant-to-concordant odds ratio estimate were then visualized.

Results of this Monte Carlo analysis are summarized in Figure 6, and indicated that the discordant-toconcordant odds ratio was robust to even highly-prevalent unobserved variables with strong differential effects; for example, a highly-prevalent missing covariate with a four-fold differential effect between patients undergoing CABG and patients undergoing PCI did not qualitatively degrade the overall discordant-to-concordant odds ratio, with mortality being twice as likely among discordant discharges than among concordant discharges.

Results of our second sensitivity analysis which evaluated the impact of an unobserved covariate on the overall discordant-to-concordant odds ratio estimate. Ten thousand datasets were generated, each a copy of the original validation dataset with the unobserved covariate X̃ randomly assigned to be present or absent among all patients. The prevalence of X̃ = 1 was set for each dataset by drawing from a random uniform distribution. For each dataset, the discordant-to-concordant odds ratio was recalculated assuming varying values for the differential effect of the treatment on in-hospital mortality between patients undergoing CABG and patients undergoing PCI (parameter γ). Panel (a) displays a scatterplot of individual simulation results for five sample values of γ vs. the randomly-assigned prevalence value for the unobserved covariate. Panel (b) displays a smoothed (via thin plate regression splines) contour plot of the three-dimensional surface describing the estimated odds ratio as a function of both γ and the randomly-assigned prevalence value. The lines in the contour plot, which are labeled in the figure, represent constant levels of the three-dimensional odds ratio surface.

Discussion

We propose a straightforward technique for measuring predicted benefit of undergoing a particular treatment relative to a comparator treatment, with respect to optimizing risk for a single outcome of interest. At its core, the procedure entails developing prediction models for each treatment which map distinct combinations of covariate values to an predicted probability of outcome, and deriving measures of evidence based on the relative values of the probability estimates. We also propose techniques for validating the model, including estimating the effect on outcome of clinical discordance to a model-based decision rule.

It is the rare patient who is “average” and can thus be expected to have outcomes matching those reported in clinical trials. Instead, individual patients vary widely with respect to genetic, environmental, and behavioral features. Nonetheless, the best current “evidence-based” approach to clinical care would prompt clinicians to make care decisions on the basis of population-average results of related clinical trials. The trouble is that decisions based on population averages may well be wrong for many individual patients. The importance of our approach is that it incorporates more information than clinical trial averages and may thus produce superior recommendations for individuals. In an example analysis of the decision between PCI and CABG, we showed that discordance in treatment selection from the model recommendation was associated with four-and-a-half times the odds of in-hospital mortality than concordance.

On the other end of the spectrum, the potential outcome for a specific patient under two competing treatments [16] is clearly not directly observable since only one treatment can be administered and the patient only experiences one outcome. While our methodology cannot directly compare two potential outcomes for a single individual, it does provide an independent prediction of the outcome under each of the treatments for distinct combinations of predictor values. This is more flexible than other proposed methods that stratify patients according to predictions of absolute risk reduction but nonetheless assume consistent relative effects across the population [17].

Our methodology also provides a mechanism for directly assessing the effects of observed medical decisions in patient-specific care interactions, relative to decisions that are informed by an empirical prediction modeling algorithm. We hypothesize that the efficiency and accuracy with which medical decisions are made are unlikely to be uniform across diverse patient populations. In situations where discordance is associated with increased risk of poor outcomes, certain subpopulations could experience greater discordance and therefore higher risk than other subpopulations (i.e., minorities, patients treated at institutions with poorer quality of care). Modifying the stratum-specific discordant-to-concordant odds ratio model described above (specifically, by replacing the stratification variable with relevant patient and/or hospital characteristics) can allow for analyses seeking to identify these subpopulations, thus providing a necessary framework for studies involving heterogeneity in the quality of decision making.

Decisions in health care are frequently made with respect to multiple endpoints. While this proposed decision modeling technique only supports the analysis of a single outcome, developing models for the most clinically-important endpoint might nonetheless help inform processes of care and associated decision making practices. Further methodological development will be necessary to enable extensions into situations with multiple outcomes.

We present multiple analyses to compare predicted risk of outcomes under two competing treatments: binary recommended treatment, discrete risk strata, decision model odds ratio, and finally direct interpretation of the vector of probability estimates. The latter is perhaps the most direct and allows for the interpretation of absolute risks under competing interventions [18, 19]. The other three pertain to relative measures of risk; while these may be less appealing than absolute risk measures for clinical applications, they do allow for various validation analyses, such as assessments of concordant vs. discordant care and our proposed calibration analysis for the decision model odds ratio.

Like all observational studies, controlling for the independent effects of unobserved variables here is impossible. Our proposed decision modeling approach therefore cannot evaluate causality. Furthermore, the accuracy of the methodology relies on the availability of relevant predictors. In this article, we proposed methodology for conducting a Monte Carlo analysis to assess the sensitivity of the results to a major unobserved covariate. For this particular analysis which investigated decisions among revascularization procedures, even a highly-prevalent predictor that is independent of the other baseline characteristics in the model and that has a four-fold differential effect between patients undergoing CABG and patients undergoing PCI did not qualitatively degrade the overall discordant-to-concordant odds ratio.

An interesting aspect of our methodology is that, unlike standard analyses of average treatment effects, bias in either the model-based decision rule or the discordant-to-concordant odds ratio due to imbalances on observed predictors is not an issue, so long as the model has been correctly specified. This is because the model estimates are conditional (made with respect to distinct combinations of predictor values), rather than marginal (made across fixed and arbitrary values of predictor values). However, the issue of imbalance on observed predictors is distinct from the issue of lack of representation (i.e., lack of overlap). This especially occurs in observational data where decisions are systematically related to the available predictor variables. We proposed an examination of propensity score distributions to evaluate the extent to which decisions are determined by observed covariates; in our particular analysis of revascularization procedures, a sizeable degree of heterogeneity in decision making was observed, with the two treatments’ propensity score distributions overlapping one another and extending over wide ranges. Nonetheless, full consideration of unavailable predictors is important for gauging the model’s clinical usefulness; in our example model for revascularization, we didn’t have information on anatomical characteristics.

Like any prediction model, extrapolation of the decision model into areas of the covariate space not supported by the data risks potential bias and greater uncertainty. Research into methods for describing extrapolation of the decision model and evaluating when predictions should or should not be made is needed. When practical decisions can be made to restrict study inclusion criteria to the subpopulation of patients for whom there is heterogeneity in decision-making, such restrictions are recommended since they help to maximize overlap in covariates between patients receiving each treatment. In addition, we recommend that these techniques should only be applied to large health data registries, so that overlap in populations and precision of underlying prediction models are maximized. Finally, our method assumes that the prediction problem is static in nature; extensions to time-dependent settings would require consideration for potential biases due to time-varying covariates.

Analyses using randomized data can overcome many of the aforementioned issues. However, since the size of randomized clinical registries is typically determined by an analysis of power for the average treatment effect, they are rarely large enough to construct subgroup analyses [20] let alone a full prediction analysis that evaluates outcomes of alternative treatment decisions relative to patients’ individual characteristics. An exception is with revascularization: the SYNTAX score II [21] incorporated both anatomical characteristics (as represented by the original SYNTAX score) [22, 23] and clinical characteristics Model development was based on randomized data from 1,800 patients, and seven clinical predictors that exhibited both significant main effects (α=0.05) within procedure-specific models and a significant two-way interaction (α=0.10) with the randomized CABG/PCI treatment in the combined cohort were included in the model. This analysis was clearly unique and innovative, representing a step beyond existing models in cardiac surgery (e.g., EuroSCORE [24], Society of Thoracic Surgeons’ Predicted Risk of Mortality³, Wu’s score for mortality after PCI [25], etc.) which only predict outcomes (under a chosen treatment). However, the extent to which model performance might be improved with an even larger dataset – one that could support more complex interactions among a greater number of predictors – is unknown. Since acquiring randomized data on such a large sample is often prohibitively expensive, we feel that our approach may be a useful covariate-specific analogue to well-controlled (e.g., propensity-matched) observational studies evaluating average treatment effects.

Methodologically, we propose the development of separate prediction models within each treatment as opposed to a single interaction model that (implicitly) assumes the same predictors are used for modeling both treatments. This departs from prior efforts such as the SYNTAX score II. Different mechanisms might plausibly influence mortality risk under CABG than those which influence it under PCI (e.g., deep sternal wound infection is a risk with CABG [26] but not with PCI). While the interaction approach can handle this situation simply by assigning coefficients near zero, the use of separate models allows for the application of modern feature selection routines such as lasso/elastic net shrinkage and random forests within each treatment. As we mention in the Introduction, however, our model for revascularization was strictly to facilitate descriptions of the proposed methodology and comparisons with existing approaches.

In summary, we propose a practical technique for developing and validating empirical decision models using large health care data registries. The technique can be especially useful in assessing observed clinical treatments against model-recommended treatments. Further research will be necessary to evaluate the utility of these models for informing actual patient care decisions – especially when they are derived using large observational datasets, since the possibility of residual confounding due to lack of important predictors can never be fully excluded.

Acknowledgments

Financial support for this study was provided in part by the Clinical and Translational Science Collaborative of Cleveland, KL2TR000440 from the National Center for Advancing Translational Sciences component of the National Institutes of Health (NIH) and the NIH roadmap for Medical Research. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing and publishing the report.

Appendix 1

Derivation of the calibration curves for both the individual treatments’ probability models as well as the decision model odds ratio

Prior to introducing the model which estimates the calibration curve for the decision model odds ratio θ̈_i, we first summarize a recently-published method for recalibrating (or bias-correcting) binary clinical prediction models [13] in the context of our decision modeling analysis. The calibration model for θ̈_i then follows as an extension of that methodology.

Recalibration of Individual Probability Models

As mentioned in the Methods, calibration of the individual probability functions p̂^(T)(X) (for treatment T ∈ {1,2}) is performed using dedicated model calibration datasets. For a given T, the process thus begins by applying the probability model the p̂^(T)(X) to its model calibration dataset, obtaining predictions p̂_i (where i indexes the observations in the model calibration dataset for treatment T). These predicted probabilities are then converted to the log-odds scale, using the equation

{\hat{η}}_{i} = {\hat{η}}^{(T)} (X_{i}) = log (\frac{{\hat{p}}_{i}}{1 - {\hat{p}}_{i}}),

Where the subscript i indexes the observations in that treatment’s model calibration dataset. The standard linear-logistic calibration function proposed by D.R. Cox (1958) [27] is of the form

log (\frac{p_{i}}{1 - p_{i}}) = β_{0} + β_{1}^{‡} {\hat{η}}_{i} + {error}_{i},

where p_i represents the true (unknown) probability of the outcome for patient i in the treatment’s model calibration dataset. Dalton [13] modified this model by introducing an offset term for the risk score [28] (i.e., a parameter with a fixed coefficient of one):

log (\frac{p_{i}}{1 - p_{i}}) = β_{0} + (β_{1} + 1) {\hat{η}}_{i} + {error}_{i} .

Note that $β_{1}^{‡} = β_{1} + 1$ . After some algebraic manipulations, we observe that adding the offset term allows for the following equivalent specification of Cox’s linear-logistic calibration model:

log (γ_{i}) = log (\frac{p \div (1 - p)}{\hat{p} \div (1 - \hat{p})}) = β_{0} + β_{1} {\hat{η}}_{i} + {error}_{i} .

In this equation, the observed-to-expected odds ratio Y_i represents the miscalibrations between model predictions and observed outcome incidences, and we see that such miscalibrations are modeled according to a simple linear equation with a slope β₀ and an intercept with a slope β₁. Furthermore, adding the offset term and investigating this linear equation allows for the assessment of model calibration against a horizontal line given by log(γ_i) = 0 – that is, ideal calibration is indicated by coefficients of zero for both with a slope β₀and with a slope β₁. (In the standard linear-logistic calibration model of D.R. Cox, ideal calibration is indicated by coefficients of 0 and 1, respectively.)

The predominant feature of the latter version of the model is that it allows for greater flexibility in modeling miscalibrations. Now, the right hand side of the equation need not be a simple linear function. It can be modified to more complex functions such as polynomials, splines, or piecewise constants (for brevity, we use the simplified notation Ḧ(η̂_i) to represent the predicted values obtained by such models). Furthermore, regardless of the functional form chosen, the right hand side of the calibration equation can now be used to bias-correct (or recalibrate) original model estimates so that they are better aligned with observed outcome incidences in external data. To do this, the estimated value for log(γ_i) – now given by the more general expression log(Ŷ_i) = Ḧ(η̂_i) – is added to the original estimate of the log-odds η̂_i to obtain the recalibrated log-odds η̃_i = η̂_i + Ḧ(η̂_i), which can then be converted back to the probability scale using the sigmoid function, i.e., p̃_i = [1 + e^−2η̃_i]⁻¹.

Assessing Calibration of the Decision Model Odds Ratio

In the basic calibration model given above, the offset term represented the expected (or predicted) log-odds of the outcome under the treatment’s probability model p̂^(T)(X).

To adapt the model so that it allows for assessment of θ̂, this expected log-odds of the outcome is expressed differently, depending on whether or not the patient was recommended for treatment 1 or treatment 2 and depending on whether or not the patient actually received the recommended treatment. Algebraic manipulations (see Appendix 2) reveal that the offset (or risk score, RS) can be simplified and expressed as the expected log-odds under the observed treatment, i.e.,

R S_{i} = {\begin{matrix} log (\frac{{\hat{p}}^{(1)} (X_{i})}{1 - {\hat{p}}^{(1)} (X_{i})}) & if & O_{i} = 1; \\ log (\frac{{\hat{p}}^{(2)} (X_{i})}{1 - {\hat{p}}^{(2)} (X_{i})}) & if & O_{i} = 2, \end{matrix}

where the subscript i now indexes the observations in our validation dataset. Now, we introduce the following decision model odds ratio calibration model:

log (\frac{p_{i}}{1 - p_{i}}) = 1 (R S_{i}) + β_{1} D_{i} + β_{2} D_{i} log ({\hat{θ}}_{i}) + {error}_{i} .

Note that this model lacks the intercept term β₀. Using the offset term of 1(RS_i) once again allows us to model observed-to-expected odds ratios, where the expected odds in the denominator now represents that given by the decision model odds ratio model, as described above.

In the model, the parameter β₁ represents calibration in the large of the decision model odds ratio among discordant patients, i.e., the average over- or under-estimate of log(θ̂_i). Likewise, the parameter β₂ represents the slope of the miscalibration function. Perfect calibration of the decision model odds ratio is thus represented by β₁ = β₂ = 0.

In the section above regarding the calibration procedure for the individual probability models, we described the extension of the linear-logistic calibration model into situations with more complicated miscalibration relationships. This can be done here as well according to the following more general version of the decision model odds ratio calibration model:

log (\frac{p_{i}}{1 - p_{i}}) = 1 (R S_{i}) + D_{i} (β' H (log ({\hat{θ}}_{i}))) + {error}_{i},

where β is a vector of regression coefficients and H(log(θ̂_i)) is a matrix of basis expansion variables [7, 13] derived from log(θ̂i). Note that H must include a column of ones in order to represent the parameter associated with calibration in the large. For example, the columns in H might contain polynomial terms, splines, or indicator variables to represent discrete intervals for log(θ̂_i). In implementing this model using standard statistical analysis software, the offset is specified along with a main effect for the discordance indicator D_i and an interaction term between D_i and the desired term describing the nature of miscalibration (e.g., a cubic spline term with 3 knots).

In theory, this calibration curve can readily be used to correct estimates of θ̈_t, in a similar manner to what was done for the individual probability models p̂⁽¹⁾(X) and p̂⁽²⁾(X) (i.e., add the calibration curve to the original value to get a recalibrated value). However, since θ̈_i is merely a function of p̂⁽¹⁾(X) and p̂⁽²⁾(X), we feel that fixing calibration issues with the underlying probability models is a more direct approach. The calibration curve for θ̈_i nonetheless serves as a useful diagnostic tool.

Appendix 2

Proof that the offset term (expected log-odds of the outcome) under the decision model odds ratio calibration model is algebraically equivalent to the expected log-odds under the observed treatment’s probability model

We consider four scenarios – namely, all combinations of recommended treatment (R_i ∈ {1,2} and concordance vs. discordance to recommended treatment (D_i ∈ {0,1}) – and show that in all cases the expected log-odds under the decision model odds ratio model (i.e., the risk score, or RS_i) are equal to the expected log-odds from the underlying probability model predicting the outcome for the treatment actually administered (p̂(O_i) (X_i) transformed to the log-odds scale, where O_i represents the treatment actually administered).

The derivations consider the risk score as equal to the predicted log-odds under the recommended treatment, plus an increase in risk represented by ±log(θ̂_i) if discordant treatment was administered. As per the definition of θ̈_i as the relative odds comparing treatment 2 to treatment 1, the increment in risk is θ̈_i when R_i = 1 and D_i = 1, and the increment in risk is ${\hat{θ}}_{i}^{- 1}$ when R_i = 2 and D_i = 1. On the logarithmic scale, we thus have +log(θ̂_i) and –log(θ̂_i), respectively.

The quantity log(θ̂_i) can be simplified as follows:

log ({\hat{θ}}_{i}) = log (\frac{{\hat{p}}^{(2)} (X_{i}) \div (1 - {\hat{p}}^{(2)} (X_{i}))}{{\hat{p}}^{(1)} (X_{i}) \div (1 - {\hat{p}}^{(1)} (X_{i}))}) = log ({\hat{p}}^{(2)} (X_{i})) - log (1 - {\hat{p}}^{(2)} (X_{i})) - [log ({\hat{p}}^{(1)} (X_{i})) - log (1 - {\hat{p}}^{(1)} (X_{i}))] = log ({\hat{p}}^{(2)} (X_{i})) - log (1 - {\hat{p}}^{(2)} (X_{i})) - log ({\hat{p}}^{(1)} (X_{i})) + log (1 - {\hat{p}}^{(1)} (X_{i})) .

Of course, we also have that

- log ({\hat{θ}}_{i}) = log ({\hat{p}}^{(1)} (X_{i})) - log (1 - {\hat{p}}^{(1)} (X_{i})) - log ({\hat{p}}^{(2)} (X_{i})) + log (1 - {\hat{p}}^{(2)} (X_{i})) .

Now, we consider four cases according to recommended treatment and whether or not the treatment administered was discordant:

Case #1: R_i = 1, D_i = 1 (O_i = 2)
$R S_{i} = Predicted log - odds under R_{i} = 1 model + log ({\hat{θ}}_{i}) = log ({\hat{p}}^{(1)} (X_{i})) - log (1 - {\hat{p}}^{(1)} (X_{i})) + log (\hat{θ}) = log ({\hat{p}}^{(2)} (X_{i})) - log (1 - {\hat{p}}^{(2)} (X_{i}))$
Case #2: R_i = 1, D_i = 0 (O_i = 1)
$R S_{i} = Predicted log - odds under R_{i} = 1 model = log ({\hat{p}}^{(1)} (X_{i})) - log (1 - {\hat{p}}^{(1)} (X_{i}))$
Case #3: R_i = 2, D_i = 1 (O_i = 1)
$R S_{i} = Predicted log - odds under R_{i} = 2 model - log ({\hat{θ}}_{i}) = log ({\hat{p}}^{(2)} (X_{i})) - log (1 - {\hat{p}}^{(2)} (X_{i})) - log ({\hat{θ}}_{i}) = log ({\hat{p}}^{(1)} (X_{i})) - log (1 - {\hat{p}}^{(1)} (X_{i}))$
Case #4: R_i = 2, D_i = 0 (O_i = 2)
$R S_{i} = Predicted log - odds under R_{i} = 2 model = log ({\hat{p}}^{(2)} (X_{i})) - log (1 - {\hat{p}}^{(2)} (X_{i}))$

Thus, in all four cases, we have that RS_i is equal to the predicted log-odds of the outcome under the O_i model, i.e., the model predicting risk for the observed treatment:

R S_{i} = {\begin{matrix} log (\frac{{\hat{p}}^{(1)} (X_{i})}{1 - {\hat{p}}^{(1)} (X_{i})}) & if & O_{i} = 1; \\ log (\frac{{\hat{p}}^{(2)} (X_{i})}{1 - {\hat{p}}^{(2)} (X_{i})}) & if & O_{i} = 2 . \end{matrix}

Footnotes

Algorithmic details of the elastic net modeling procedure are as follows: a mixing parameter – i.e., the proportion of elastic net penalty term dedicated to the L₁ (lasso) term, as opposed to the L₂ (ridge) term – of α = 0.90 was chosen for the CABG model and a mixing parameter of α = 0.85 was chosen for the PCI model based on minimization of the model goodness of fit criterion via cross validation. The model shrinkage parameter λ was chosen as the largest value for which cross-validated prediction error was not more than one standard error greater than that observed for the value of λ that minimized cross-validated prediction error.

We assume here and throughout the rest of the article (without loss of generality) that the outcome is undesired, and as such, lower probabilities of the outcome are better.

Contributor Information

Jarrod E. Dalton, Departments of Quantitative Health Sciences and Outcomes Research, Cleveland Clinic, Cleveland, Ohio.

Neal V. Dawson, Center for Healthcare Research and Policy, Case Western Reserve University/MetroHealth Medical Center, Cleveland, Ohio.

Daniel I. Sessler, Department of Outcomes Research, Cleveland Clinic, Cleveland, Ohio.

Jesse D. Schold, Department of Quantitative Health Sciences and Center for Renal and Pancreas Transplantation, Cleveland Clinic, Cleveland, Ohio.

Thomas E. Love, Center for Healthcare Research and Policy, Case Western Reserve University/MetroHealth Medical Center, Cleveland, Ohio.

Michael W. Kattan, Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, Ohio.

References

1.Sox HC, Greenfield S. Comparative effectiveness research: a report from the Institute of Medicine. Annals of Internal Medicine. 2009;151(3):203–205. doi: 10.7326/0003-4819-151-3-200908040-00125. [DOI] [PubMed] [Google Scholar]
2.Rubin DB. On the limitations of comparative effectiveness research. Statistics in Medicine. 2010;29(19):1991–1995. doi: 10.1002/sim.3960. [DOI] [PubMed] [Google Scholar]
3.Hillis LD, et al. 2011 ACCF/AHA Guideline for Coronary Artery Bypass Graft Surgery: executive summary: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Circulation. 2011;124(23):2610–42. doi: 10.1161/CIR.0b013e31823b5fee. [DOI] [PubMed] [Google Scholar]
4.Levine GN, et al. 2011 ACCF/AHA/SCAI Guideline for Percutaneous Coronary Intervention: A Report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines and the Society for Cardiovascular Angiography and Interventions. Circulation. 2011;124(23):e574–e651. doi: 10.1161/CIR.0b013e31823ba622. [DOI] [PubMed] [Google Scholar]
5.Windecker S, et al. 2014 ESC/EACTS Guidelines on myocardial revascularization: The Task Force on Myocardial Revascularization of the European Society of Cardiology (ESC) and the European Association for Cardio-Thoracic Surgery (EACTS) Developed with the special contribution of the European Association of Percutaneous Cardiovascular Interventions (EAPCI) EuroIntervention. 2014 doi: 10.1093/ejcts/ezu366. [DOI] [PubMed] [Google Scholar]
6.Efron B, Tibshirani R. Improvements on Cross-Validation: The 632+ Bootstrap Method. Journal of the American Statistical Association. 1997;92(438):548–560. [Google Scholar]
7.Hastie T, Tibshirani R, Friedman JH. Springer series in statistics. 2nd ed. xxii. New York, NY: Springer; 2009. The elements of statistical learning : data mining, inference, and prediction; p. 745. [Google Scholar]
8.Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI. 1995 [Google Scholar]
9.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
10.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320. [Google Scholar]
11.Dalton JE, et al. Impact of present-on-admission indicators on risk-adjusted hospital mortality measurement. Anesthesiology. 2013;118(6):1298–1306. doi: 10.1097/ALN.0b013e31828e12b3. [DOI] [PubMed] [Google Scholar]
12.Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer; 2001. [Google Scholar]
13.Dalton JE. Flexible recalibration of binary clinical prediction models. Stat Med. 2013;32(2):282–289. doi: 10.1002/sim.5544. [DOI] [PubMed] [Google Scholar]
14.Russell LB. Do We Really Value Identified Lives More Highly Than Statistical Lives? Medical Decision Making. 2013 doi: 10.1177/0272989X13512183. 0272989X13512183. [DOI] [PubMed] [Google Scholar]
15.Epstein AJ, et al. Coronary revascularization trends in the United States, 2001–2008. JAMA. 2011;305(17):1769–1776. doi: 10.1001/jama.2011.551. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Rubin DB. Causal inference using potential outcomes. Journal of the American Statistical Association. 2005;100(469) [Google Scholar]
17.Hingorani AD, et al. Prognosis research strategy (PROGRESS) 4: stratified medicine research. BMJ: British Medical Journal. 2013;346 doi: 10.1136/bmj.e5793. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bogardus ST, Jr, Holmboe E, Jekel JF. Perils, pitfalls, and possibilities in talking about medical risk. JAMA. 1999;281(11):1037–1041. doi: 10.1001/jama.281.11.1037. [DOI] [PubMed] [Google Scholar]
19.Steyerberg EW, et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology. 2010;21(1):128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kent DM, et al. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials. 2010;11(1):85. doi: 10.1186/1745-6215-11-85. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Farooq V, et al. Anatomical and clinical characteristics to guide decision making between coronary artery bypass surgery and percutaneous coronary intervention for individual patients: development and validation of SYNTAX score II. The Lancet. 2013;381(9867):639–650. doi: 10.1016/S0140-6736(13)60108-7. [DOI] [PubMed] [Google Scholar]
22.Serruys PW, et al. Assessment of the SYNTAX score in the Syntax study. EuroIntervention. 2009;5(1):50–56. doi: 10.4244/eijv5i1a9. [DOI] [PubMed] [Google Scholar]
23.Sianos G, et al. The SYNTAX Score: an angiographic tool grading the complexity of coronary artery disease. EuroIntervention. 2005;1(2):219–227. [PubMed] [Google Scholar]
24.Nashef SA, et al. European system for cardiac operative risk evaluation (EuroSCORE) European journal of cardio-thoracic surgery. 1999;16(1):9–13. doi: 10.1016/s1010-7940(99)00134-7. [DOI] [PubMed] [Google Scholar]
25.Wu C, et al. A risk score to predict in-hospital mortality for percutaneous coronary interventions. Journal of the American College of Cardiology. 2006;47(3):654–660. doi: 10.1016/j.jacc.2005.09.071. [DOI] [PubMed] [Google Scholar]
26.Toumpoulis IK, et al. The impact of deep sternal wound infection on long-term survival after coronary artery bypass grafting. Chest Journal. 2005;127(2):464–471. doi: 10.1378/chest.127.2.464. [DOI] [PubMed] [Google Scholar]
27.Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45(3/4):562–565. [Google Scholar]
28.Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and updating. Springer; 2009. [Google Scholar]

[R1] 1.Sox HC, Greenfield S. Comparative effectiveness research: a report from the Institute of Medicine. Annals of Internal Medicine. 2009;151(3):203–205. doi: 10.7326/0003-4819-151-3-200908040-00125. [DOI] [PubMed] [Google Scholar]

[R2] 2.Rubin DB. On the limitations of comparative effectiveness research. Statistics in Medicine. 2010;29(19):1991–1995. doi: 10.1002/sim.3960. [DOI] [PubMed] [Google Scholar]

[R3] 3.Hillis LD, et al. 2011 ACCF/AHA Guideline for Coronary Artery Bypass Graft Surgery: executive summary: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Circulation. 2011;124(23):2610–42. doi: 10.1161/CIR.0b013e31823b5fee. [DOI] [PubMed] [Google Scholar]

[R4] 4.Levine GN, et al. 2011 ACCF/AHA/SCAI Guideline for Percutaneous Coronary Intervention: A Report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines and the Society for Cardiovascular Angiography and Interventions. Circulation. 2011;124(23):e574–e651. doi: 10.1161/CIR.0b013e31823ba622. [DOI] [PubMed] [Google Scholar]

[R5] 5.Windecker S, et al. 2014 ESC/EACTS Guidelines on myocardial revascularization: The Task Force on Myocardial Revascularization of the European Society of Cardiology (ESC) and the European Association for Cardio-Thoracic Surgery (EACTS) Developed with the special contribution of the European Association of Percutaneous Cardiovascular Interventions (EAPCI) EuroIntervention. 2014 doi: 10.1093/ejcts/ezu366. [DOI] [PubMed] [Google Scholar]

[R6] 6.Efron B, Tibshirani R. Improvements on Cross-Validation: The 632+ Bootstrap Method. Journal of the American Statistical Association. 1997;92(438):548–560. [Google Scholar]

[R7] 7.Hastie T, Tibshirani R, Friedman JH. Springer series in statistics. 2nd ed. xxii. New York, NY: Springer; 2009. The elements of statistical learning : data mining, inference, and prediction; p. 745. [Google Scholar]

[R8] 8.Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI. 1995 [Google Scholar]

[R9] 9.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]

[R10] 10.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320. [Google Scholar]

[R11] 11.Dalton JE, et al. Impact of present-on-admission indicators on risk-adjusted hospital mortality measurement. Anesthesiology. 2013;118(6):1298–1306. doi: 10.1097/ALN.0b013e31828e12b3. [DOI] [PubMed] [Google Scholar]

[R12] 12.Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer; 2001. [Google Scholar]

[R13] 13.Dalton JE. Flexible recalibration of binary clinical prediction models. Stat Med. 2013;32(2):282–289. doi: 10.1002/sim.5544. [DOI] [PubMed] [Google Scholar]

[R14] 14.Russell LB. Do We Really Value Identified Lives More Highly Than Statistical Lives? Medical Decision Making. 2013 doi: 10.1177/0272989X13512183. 0272989X13512183. [DOI] [PubMed] [Google Scholar]

[R15] 15.Epstein AJ, et al. Coronary revascularization trends in the United States, 2001–2008. JAMA. 2011;305(17):1769–1776. doi: 10.1001/jama.2011.551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Rubin DB. Causal inference using potential outcomes. Journal of the American Statistical Association. 2005;100(469) [Google Scholar]

[R17] 17.Hingorani AD, et al. Prognosis research strategy (PROGRESS) 4: stratified medicine research. BMJ: British Medical Journal. 2013;346 doi: 10.1136/bmj.e5793. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Bogardus ST, Jr, Holmboe E, Jekel JF. Perils, pitfalls, and possibilities in talking about medical risk. JAMA. 1999;281(11):1037–1041. doi: 10.1001/jama.281.11.1037. [DOI] [PubMed] [Google Scholar]

[R19] 19.Steyerberg EW, et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology. 2010;21(1):128–138. doi: 10.1097/EDE.0b013e3181c30fb2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Kent DM, et al. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials. 2010;11(1):85. doi: 10.1186/1745-6215-11-85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Farooq V, et al. Anatomical and clinical characteristics to guide decision making between coronary artery bypass surgery and percutaneous coronary intervention for individual patients: development and validation of SYNTAX score II. The Lancet. 2013;381(9867):639–650. doi: 10.1016/S0140-6736(13)60108-7. [DOI] [PubMed] [Google Scholar]

[R22] 22.Serruys PW, et al. Assessment of the SYNTAX score in the Syntax study. EuroIntervention. 2009;5(1):50–56. doi: 10.4244/eijv5i1a9. [DOI] [PubMed] [Google Scholar]

[R23] 23.Sianos G, et al. The SYNTAX Score: an angiographic tool grading the complexity of coronary artery disease. EuroIntervention. 2005;1(2):219–227. [PubMed] [Google Scholar]

[R24] 24.Nashef SA, et al. European system for cardiac operative risk evaluation (EuroSCORE) European journal of cardio-thoracic surgery. 1999;16(1):9–13. doi: 10.1016/s1010-7940(99)00134-7. [DOI] [PubMed] [Google Scholar]

[R25] 25.Wu C, et al. A risk score to predict in-hospital mortality for percutaneous coronary interventions. Journal of the American College of Cardiology. 2006;47(3):654–660. doi: 10.1016/j.jacc.2005.09.071. [DOI] [PubMed] [Google Scholar]

[R26] 26.Toumpoulis IK, et al. The impact of deep sternal wound infection on long-term survival after coronary artery bypass grafting. Chest Journal. 2005;127(2):464–471. doi: 10.1378/chest.127.2.464. [DOI] [PubMed] [Google Scholar]

[R27] 27.Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45(3/4):562–565. [Google Scholar]

[R28] 28.Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and updating. Springer; 2009. [Google Scholar]

PERMALINK

Empirical Treatment Effectiveness Models for Binary Outcomes

Jarrod E Dalton, PhD

Neal V Dawson, MD

Daniel I Sessler, MD

Jesse D Schold, PhD

Thomas E Love, PhD

Michael W Kattan, PhD

Roles

Abstract

Modeling Methodology

Data Considerations

Phase I: Model Development

Figure 1.

Figure 2.

Decision Criteria and Related Measures

1) Binary Recommended Treatment

2) Decision Model Odds Ratio

3) Other Approaches

Figure 3.

Table 1.

Phase II – Model Validation

Figure 4.

Table 2.

Sensitivity Analyses

1) Influence of Covariates on Treatment Selection

Figure 5.

2) Major Unobserved Covariates

Figure 6.

Discussion

Acknowledgments

Appendix 1

Derivation of the calibration curves for both the individual treatments’ probability models as well as the decision model odds ratio

Recalibration of Individual Probability Models

Assessing Calibration of the Decision Model Odds Ratio

Appendix 2

Proof that the offset term (expected log-odds of the outcome) under the decision model odds ratio calibration model is algebraically equivalent to the expected log-odds under the observed treatment’s probability model

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases