Abstract
Objective.
To improve the estimation of healthcare expenditures by introducing a novel method that is well-suited to situations where data exhibit strong skewness and zero-inflation.
Data Sources.
Simulations, and two real-world datasets: the 2016–2017 Medical Expenditure Panel Survey (MEPS); the Back Pain Outcomes using Longitudinal Data (BOLD).
Study Design.
Super learner is an ensemble machine learning approach that can combine several algorithms to improve estimation. We propose a two-stage super learner that is well suited for healthcare expenditure data by separately estimating the probability of any healthcare expenditure and the mean amount of healthcare expenditure conditional on having healthcare expenditures. These estimates can then be combined to yield a single estimate of expenditures for each observation. The analytical strategy can flexibly incorporate a range of individual estimation approaches for each stage of estimation, including both regression-based approaches and machine learning algorithms such as random forests. We compare the performance of the two-stage super learner with a one-stage super learner, and with multiple individual algorithms for estimation of healthcare cost under a broad range of data settings in simulated and real data. The predictive performance was compared using Mean Squared Error and R2.
Conclusions.
Our results indicate that the two-stage super learner has better performance compared with a one-stage super learner and individual algorithms, for healthcare cost estimation under a wide variety of settings in simulations and in empirical analyses. The improvement of the two-stage super learner over the one-stage super learner was particularly evident in settings when zero-inflation is high.
Keywords: Semicontinuous data, two-part models, zero-inflation, super learning, healthcare expenditure
1. Introduction
The study of healthcare expenditures is a key component of health services research with common applications including estimating future expenditures, expenditures related to particular health conditions, and effects of interventions on total expenditures (Gregori et al. 2011). However, statistical modeling of healthcare expenditures is often challenging for two reasons: individuals without health care utilization can lead to zero inflation, and very high expenditures in a small number of individuals can lead to large positive skewness (Manning and Mullahy 2001; Jones 2010). In the United States, Berk & Monheit (2001) report that in a typical year, 10% or more of individuals incur no healthcare expenditures, while 5% of the population accounts for the majority of health expenditures.
Many methods have been proposed to deal with these challenges (Jones 2010). A common approach involves regression of log-transformed cost using ordinary least squares. Duan’s smearing estimator (Duan 1983) can be used to map estimates from the log-scale back to the original scale. Other approaches include non-Gaussian Generalized Linear Models (GLM) (Duan 1983; Blough et al. 1999; Manning and Mullahy 2001), Accelerated Failure Time (AFT) models (Manning 2005), hazard-based models (Cox 1972; Gilleskie and Mroz 2004; Basu and Rathouz 2005), and quantile-based models (Wang and Zhou 2010). To explicitly account for zero-inflation, researchers often use two-part models (Mullahy 1998). In this approach, separate models are specified for the probability of any cost, and for the mean of costs amongst individuals with positive costs. Two-part GLMs with a logit link for the binary component and a Gamma distribution with log-link for the continuous component are common in practice (Finkelstein et al. 2009; Cook et al. 2010; Cawley and Meyerhoefer 2012).
All of these methods can work well when the outcome distribution meets the corresponding model assumptions. However, practitioners are often unaware of whether this is the case for a given data set and unfortunately, there is no “one-size-fits-all” approach (Basu and Manning 2009). Moreover, many common methods rely on parametric regression models, which are commonly mis-specified in practice, and can introduce bias in the results. The challenge to accurately model cost data has led to increased interest in machine learning techniques, which use more flexible approaches to learn about relationships in data, thereby providing, in many cases, more accurate cost predictions (James et al. 2013). Specifically, penalization and ensemble approaches have been applied to modeling healthcare expenditures (Rose 2016; Morid et al. 2017; Shrestha et al. 2018; Zink and Rose 2020).
In this work, we consider super learning, an ensemble method that combines a range of candidate models also known as model stacking, for application to healthcare expenditures (Wolpert 1992; Breiman 1996; Leblanc and Tibshirani 1996; van der Laan et al. 2007). Super learning entails using cross-validation to learn the optimal combination of a collection of candidate methods. This collection could include parametric and machine learning-based algorithms. Super learning has shown benefits over a single method in several healthcare studies (Rose 2013; Kessler et al. 2014; Pirracchio 2015) including prediction of expenditures in the context of plan payment risk adjustment (Rose 2016). However, no past research on super learning has focused specifically on zero-inflated data.
In this work, we propose a two-stage super learner. We define a set of candidate methods for predicting the presence of any costs, as well as for the positive portion of the cost distribution. The full super learner library consists of all pairwise combinations of the methods for each stage. We provide an efficient implement of the two-stage super learner with tailoring for challenges associated with healthcare cost data. We compare the two-stage super learner with the regular, one-stage super learner, along with individual methods commonly used in studies of healthcare expenditures using Monte Carlo simulations under various data generating processes. In addition, we analyze data from the 2016–2017 Medical Expenditure Panel Survey (MEPS) (Cohen 2003; Cohen 2009) and Back pain Outcomes using Longitudinal Data (BOLD) project (Jarvik et al. 2012). In both cases, the two-stage super learner improved performance over existing approaches.
2. Methods
2.1. Two-stage super learner
2.1.1. Two-part model
The two-part model consists of a binary model for the probability of the outcome being positive and a regression model for the mean cost applied to the positive subsample. Let Y denote the healthcare cost which is assumed to be non-negative, and X denote a potentially high-dimensional vector of predictors. Motivation for a two-part model is that the mean, E[Y|X], can be written as:
The two pieces of this equation can be estimated separately. The first piece, Pr(Y > 0 | X) is commonly modeled using logistic regression, although in principle any regression suitable for a binary outcome could be used. The second component E[Y|Y > 0, X] is commonly modeled using Gamma-family GLMs with log link functions to account for high skewness, though again any suitable regression could be used. Given the dizzying array of possible regression approaches for each component of this model, it could be quite useful for practitioners to have both a formal framework for selecting amongst the many choices and a general recommendation on an optimal strategy when the goal is to obtain high prediction accuracy.
2.1.2. Super Learning
Super learning provides a formal means of selecting a combination of algorithms that best fits the true regression function. Here, the notion of “best” refers to a cross-validated risk criterion. Suppose we have a dataset of independent observations on n individuals, Oi = (Yi, Xi), i = 1, …, n, where Y is the outcome of interest and X is a set of covariates. Suppose we have access to a candidate prediction function , e.g., that provides a prediction for a new observation x based on a fitted two-stage model. We introduce the notion of the risk of , which provides a global summary of how well predicts outcomes Y based on covariates X. Often (Benkeser et al. 2020) we rely on risk measures that can be expressed as the average discrepancy between Y and the prediction made by . In other words, given a regression , we can define a loss function , that takes as input a particular data point (x, y) and returns a real number measuring the discrepancy between and y. A larger value of the loss indicates a further gap between prediction and truth. Common loss functions for cost data include mean squared error (MSE) for mean cost estimation, mean absolute error (MAE) for median cost estimation and “check function” (Yu 2003) for quantile cost estimation (e.g. 90th percentile of cost). In this study, we explicitly consider the squared-error loss function and consider mean squared error (MSE), , as our primary risk criterion. Given a risk criterion, we can define the optimal prediction function, say Q0 as the function that minimizes risk over all possible prediction functions.
The performance optimization problem can be equivalently defined as a statistical estimation problem. For example, it is straightforward to show that for squared error loss, Q0(X) = E(Y|X). Thus, the task of learning the optimal (with respect to MSE) prediction function is equivalent to the task of estimating the conditional mean of cost given covariates. Super learner provides one objective and flexible strategy for accomplishing this task.
In a given problem, there are often many different individual approaches to developing a prediction function. In super learning, we call each approach an algorithm and refer to a pre-specified collection of algorithms as a library. Here, “algorithm” is used in a general sense as any means of mapping a given data set into a prediction function. We use training to refer to the process of applying an algorithm to data. Examples of algorithms include (i) fitting ordinary least squares regression and returning the linear predictor; (ii) performing variable screening based on a univariate significance threshold, then applying ordinary least squares regression; (iii) training a random forest, where tuning parameters are selected via cross-validation. The super learner library should be, to the greatest extent possible, informed by subject-area expertise, but can utilize data-driven, machine learning approaches as well (Benkeser et al. 2020).
Suppose we have K potential algorithms. The implementation of a super learner involves using cross-validation to determine an ensemble of individual algorithms. The process can be implemented as follows.
Train each algorithm in the library on the entire dataset yielding prediction functions , k = 1, …, K.
Randomly split the data into V mutually exclusive and exhaustive blocks of approximately equal size. For υ = 1, …, V, define the υ-th block as the υ-th validation sample, and the remaining V − 1 blocks the υ-th training sample.
For υ = 1, …, V, train each of the K algorithms using the υ-th training sample yielding prediction functions . For each observation Xi in the υ-th validation sample obtain the prediction .
- Propose an ensemble of algorithms in the library indexed by K-length-weight vector α, e.g,
- Find the weights that minimize the cross-validated mean squared error of the ensemble over all possible values of α, e.g,
where the minimum is taken over all αk ≥ 0 ∀ k, . The super learner is
In step 4, many choices of ensembles could be considered. One particular choice is to build the ensemble from a set of weight vectors that each assigns a weight of 1 to a particular algorithm and 0 weight to all others. The choice of assigning weight of 1 to a particular algorithm is known as the cross-validation selector or discrete super learner since it represents the individual algorithm with the lowest cross-validated risk.
Theoretical guarantees (van der Laan and Dudoit 2003; van der Laan et al. 2006; van der Laan et al. 2007) state that in large samples the super learner should predict costs essentially as well as or better than the best-performing single algorithm amongst the K considered. Because the super learner is only capable of predicting as well as the best amongst the K algorithms used in its construction, this theory suggests that it is advantageous to consider a diverse mixture of algorithms when constructing a super learner. In the next section, we propose a strategy for obtaining a large and varied collection of modeling approaches in the context of zero-inflated data.
2.1.3. Two-stage Super Learner
We propose a two-stage Super Learner model, wherein we propose a library for Pr(Y > 0 | X) and for E[Y | Y > 0, X]. The overall two-stage super learner library consists of all pairwise combinations of these two models. Assuming the stage-1 library includes K1 algorithms and the stage-2 library includes K2 algorithms, then the two-stage super learner’s “whole library” would contain K1 × K2 candidate algorithms with each one representing a specific combination of algorithms from the stage-1 and stage-2.
The implementation of the two-stage super learner is similar to that of the super learner described above. The key difference is that each of our proposed candidate algorithms consists of two constituent algorithms: one for Pr(Y > 0 | X) and one for E[Y | Y > 0, X]. A prediction of Y is obtained as the product of the predictions made by each trained algorithm. Beyond this modification in how predictions are obtained, our approach is implemented as described above and the theoretical results governing the behavior of the original super learner immediately apply to our approach.
Our approach is implemented in a freely available R package. In addition to the two-stage learning approach, our software handles additional challenges associated with healthcare expenditure data. In particular, we implement a novel cross-validation approach that ensures an approximately equal split of zeros and outliers over the folds. The package also provides a modified quadratic programming approach to calculate super learner weights in the presence of very large residual values (Web Supplement-A).
2.1.4. One-stage Super Learner
The standard one-stage super learner is similar to the two-stage super learner, except that it does not split the estimation problem into two parts. When zero inflation is high, this may lead to decreased performance. However, if zero inflation is not prominent in a particular dataset, it may not be necessary to estimate the probability of zero expenditures, and thus one-stage super learner may perform well. The one-stage super learner can be computationally faster than the two-stage super learner. Since the one-stage super learner has not been widely used for healthcare cost estimation, an additional objective of this study was to examine the performance of one-stage super learner in various situations a healthcare cost analyst may encounter.
2.2. Monte Carlo Simulations
We evaluated the relative performance of the two-stage super learner compared to common approaches across a wide range of settings using Monte Carlo simulations. In particular, we studied how the two-stage super learner behaved in comparison with one-stage super learner and individual algorithms when varying the sample size, zero-inflation amount, non-zero distribution and complexity in the cost data.
2.2.1. Data generating process
For simulation study, categorical and continuous predictors were generated to include as many scenarios as possible and approximate covariates seen in real data.
Specifically, predictors were simulated as follows:
Variables X1, …, X5 impacted the distribution of costs, while the others were noise. To assess how the sample size, zero percentage, non-zero distribution, and data complexity affects the estimations, we varied four aspects of the data generating process: sample size (small (500) vs. large (2000)), zero inflation percentage (low (5%) vs. high (70%)), distribution of non-zero costs (Log-normal, Gamma, Tweedie, Mixture), and data complexity (two-way interactions among covariates, Yes and No). The combination of the four settings above results in 32 scenarios. For each scenario, we analyzed 1000 simulated data sets.
Costs were simulated using a two-stage procedure to allow for point mass at zero. In the first stage, we drew a random variable Z from a Bernoulli distribution with the probability of zero determined by a logistic model (Web Supplement-B). If Z = 0, then we set costs to zero. If Z = 1, we drew the observed costs from one of the four different non-zero distributions listed above (Web Supplement-C & Web Supplement-Figure 1).
2.2.2. Prediction algorithms
A previous study (Deb and Norton 2018) indicated that the choice of the stage-2 model may alter results more dramatically than stage-1. Accordingly, we focused our two-stage super learner towards diversity in algorithms for stage-2 by including 10 algorithms in the stage-2 model, while only including 3 algorithms in the stage-1 model. Additionally, we were interested in how the performance of a two-stage super learner compared to that of a standard super learner, thus we also fit a standard super learner with a library of 8 single-stage algorithms. Hence, we refer to this algorithm as the one-stage super learner. The algorithms considered in the simulation study include a mixture of machine learning algorithms such as random forests and parametric regressions such as GLM with Gamma distribution and log link (Web Supplement-Table 3). We compared the performance of each of 38 algorithms (30 combinations of two-stage algorithms + 8 individual algorithms), the one-stage super learner, the two-stage super learner built from 38 algorithms, and the discrete super learner, or the single algorithm in the two-stage super learner library with lowest cross-validated risk as comparisons to the two-stage super learner. For each two-stage super learner, we used V = 10 folds for cross-validation.
2.2.3. Evaluation metrics
Performances of prediction algorithms (individual algorithms, one-stage super learner, discrete super learner, and two-stage super learner) were evaluated on an independent test set of size ntest. Let yi be the true outcome values on the test set, be the predicted outcome values from certain algorithms on the test set and be the mean of yi, we use the following metrics for evaluation:
- Mean Squared Error (MSE):
- Relative MSE:
- Coefficient of determination (R2):
- Relative Efficiency (RE):
2.3. Real-world data
We made the following two changes to the two-stage super learner in adaption to the real-word data. First, we increased the number of candidate learners included in the two-stage super learner to include Support Vector Machines (Meyer et al. 2021), Neural Networks (Venables & Ripley 2002) and boosted regression trees (Chen et al. 2021). These algorithms were included both in the stage-1 and stage-2 libraries, as well as in the single-stage library. Second, an anonymous reviewer suggested that we include algorithms that return whenever the corresponding estimated probability of Y being positive is less than a threshold value c, i.e., . As a result, we added extra candidate algorithms into the two-stage super learner with the form:
Where is an indicator function of whether stage-1 predictions is greater than an threshold c or not, is the stage-1 predictions for each stage-1 algorithm and is the stage-2 predictions for each stage-2 algorithm. We might consider multiple, say nc different threshold values, since the best threshold in practice would be unknown. In this case, the two-stage super learner library consists of K1 * K2 + nc * K1 * K2 algorithms and the two-stage super learner is again the optimal weighted combination of these algorithms. We compared empirically whether the additions of algorithms so-defined improved the overall performance of the two-stage super learner. Our primary goal in the data analysis was to study the performance of the two-stage super learner for predicting costs. However, as it is often of interest to predict the presence of any costs, we additionally provide summaries of the performance of a super learner built from only the stage-1 models in terms to this end.
2.3.1. MEPS Data
We analyzed the Medical Expenditure Panel Survey (MEPS) data from 2016 to 2017 (Cohen 2003; Cohen 2009). MEPS is a national survey on the financing and use of medical care of families and individuals, their medical providers, and employers across the United States. Participating households were drawn from a nationally representative subsample and provided demographic information, health status, self-reported medical conditions, medical expenditure and utilization, health insurance coverage, and access to care for medical events. For some individuals, self-reported medical expenditures are supplemented with information from medical providers and insurers.
We used the longitudinal data of MEPS 2016–2017, with 2016 MEPS used as a training sample and the same individuals in the 2017 MEPS as a testing sample. We developed a prediction model for total annual healthcare expenditures based on 14 covariates that included demographics, medical conditions, and insurance characteristics (Web Supplement-Table 7). Total annual healthcare expenditures include out-of-pocket payments and third-party payments from all sources but exclude insurance premiums. The sample included only adults and excluded observations with missing data. The final sample contained 10,925 observations for training and 10,815 observations for testing. We used a broader library of algorithms compared to the simulation, two choices of threshold values (nc = 2 with c values of 0.25 and 0.5) and ten-fold cross-validation to build the two-stage super learner (Web Supplement-Table 8).
2.3.2. BOLD Data
The Back Pain Outcomes using Longitudinal Data (BOLD) project is a large, community-based registry of patients aged 65 years and older who presented with primary care visits for a new episode of back pain from March 2011 to March 2013 (Jarvik et al. 2012). The BOLD data are derived from self-reported questionnaires and electronic health records (EHR). Expenditures were calculated as total relative value units (RVUs), a measure of value used in the US Medicare reimbursement formula for physician services (Glass and Anderson 2002).
In this study, we considered 4 spine-related total annual RVUs as outcomes, including 1) total spine-related RVUs; 2) spine-related physical therapy RVUs; 3) spine-related injection RVUs; 4) spine-related imaging RVUs. These four outcomes allow us to empirically compare performance under varying levels of zero-inflation. We developed separate prediction models of each outcome based on 24 covariates from baseline patient questionnaires and EHR, see Web Supplement-D & Web Supplement-Table 11 for details. We excluded observations in the entire cohort with missing data in outcomes as well as covariates and the final sample contained 4,397 observations. We used a slightly different library of algorithms and applied two threshold values for each of the 4 outcomes in building the two-stage super learner (Web Supplement-Table 12). To evaluate the models, the data were randomly partitioned into ten distinct folds. Every single algorithm, one-stage super learner, discrete super learner and the two-stage super learner were trained using nine of the folds and predictions were obtained using the remaining fold. The performance was estimated as the MSE of the predictions on the validation fold, averaged over the ten validation folds.
3. Results
3.1. Monte Carlo Simulation
On average, across all settings, the two-stage super learner had the smallest MSE (Table 1), alongside discrete super learner and one-stage super learner. See Web Supplement-Table 4 & 5 for results under other evaluation metrics. Overall, the discrete super learner (the approach that used cross-validation to select the single best algorithm from the two-stage super learner library for each situation, as opposed to using a weighted average of all algorithms) was the second best-performing algorithm, while the one-stage super learner was the third. The overall MSE improvements of two-stage super learner over discrete learner, one-stage super learner and the best single algorithm were about 1.9%, 4.0% and 9.1%, respectively. Other algorithms with favorable performances included combinations of the Lasso model at stage-1 with GLM (log-link & gamma distribution) at stage-2 and Zero-inflated Negative-Binomial (ZINB) model. The overall Interquartile Range (IQR) of MSE for the two-stage super learner is smaller compared to other algorithms, indicating a robust prediction performance under different data settings. The improvement of the two-stage super learner over the one-stage super learner was more evident under high zero-inflation (70% zero) with a MSE improvement of 9.7%. This was in line with expectations as using a two-stage model is likely to be more beneficial the greater the point mass at zero. When zero inflation was small (5% zero), the one-stage super learner was superior and had the smallest MSE, but the two-stage super learner was the second best-performing algorithm and outperformed the best single algorithm (ZINB) with a MSE improvement of 4.1%.
Table 1.
Average and IQR of MSE for all algorithms across 32 data generating processes
Algorithma | MSEb (mean + IQR) (108) | ||
---|---|---|---|
Overall | Low zero-inflation | High zero-inflation | |
Two-stage Super Learner | 3.239 [1.354, 4.203] | 3.437 [1.329, 4.475] | 3.041 [1.394, 4.043] |
Discrete Super Learner | 3.301 [1.364, 4.362] | 3.459 [1.272, 4.494] | 3.143 [1.483, 4.204] |
One-stage Super Learner | 3.374 [1.482, 4.424] | 3.381 [1.398, 4.472] | 3.367 [1.692, 4.367] |
S1c: Lasso + S2d: GLM-Gamma-Loge | 3.563 [1.511, 4.610] | 3.780 [1.446, 5.066] | 3.345 [1.549, 4.230] |
S1: Lasso + S2: Quantile regression | 3.567 [1.502, 4.579] | 3.784 [1.447, 5.066] | 3.350 [1.546, 4.219] |
S1: Lasso + S2: Log OLS-smearing | 3.570 [1.504, 4.614] | 3.786 [1.448, 5.066] | 3.354 [1.546, 4.232] |
Zero-inflated Negative Binomial (ZINB)f | 3.582 [1.526, 4.641] | 3.585 [1.363, 4.820] | 3.579 [1.641, 4.529] |
S1: GLMg + S2: GLM-Gamma-Log | 3.583 [1.511, 4.638] | 3.809 [1.451, 5.124] | 3.356 [1.543, 4.246] |
S1: GLM + S2: Quantile regression | 3.589 [1.511, 4.606] | 3.818 [1.448, 5.140] | 3.359 [1.536, 4.227] |
S1: GLM + S2: Log OLS-smearing | 3.590 [1.512, 4.640] | 3.819 [1.448, 5.134] | 3.361 [1.538, 4.234] |
Zero-inflated Poisson (ZIP) | 3.601 [1.631, 4.751] | 3.613 [1.485, 4.919] | 3.589 [1.770, 4.655] |
S1: Lassoh + S2: Adaptive GLM | 3.784 [1.531, 4.804] | 3.905 [1.398, 5.017] | 3.663 [1.638, 4.623] |
S1: GLM + S2: Adaptive GLM | 3.803 [1.524, 4.831] | 3.937 [1.406, 5.089] | 3.668 [1.632, 4.615] |
S1: RFi + S2: GLM-Gamma-Logj | 4.381 [2.176, 5.453] | 4.604 [2.227, 5.890] | 4.157 [2.115, 5.002] |
S1: RF + S2: Quantile regression | 4.385 [2.180, 5.396] | 4.609 [2.237, 5.855] | 4.160 [2.122, 4.973] |
S1: RF + S2: Log OLS-smearingk | 4.390 [2.185, 5.447] | 4.617 [2.232, 5.911] | 4.163 [2.130, 4.997] |
S1: RF + S2: Adaptive GLM | 4.626 [2.204, 5.658] | 4.937 [2.261, 6.133] | 4.314 [2.176, 5.111] |
S1: GLM + S2: RF | 4.898 [1.833, 7.055] | 5.550 [1.861, 8.202] | 4.246 [1.809, 6.232] |
S1: Lasso + S2: RF | 4.900 [1.830, 7.055] | 5.556 [1.866, 8.217] | 4.244 [1.814, 6.224] |
RF | 5.304 [2.270, 7.396] | 5.439 [2.033, 7.782] | 5.168 [2.592, 7.139] |
S1: RF + S2: RF | 5.506 [2.262, 7.701] | 6.250 [2.256, 9.016] | 4.761 [2.264, 6.723] |
S1: Lasso + S2: AFT (generalized Gamma) | 6.321 [2.014, 9.634] | 6.458 [1.949, 9.906] | 6.183 [2.044, 9.357] |
S1: GLM + S2: AFT (generalized Gamma) | 6.339 [2.013, 9.650] | 6.488 [1.938, 9.930] | 6.190 [2.046, 9.359] |
S1: RF + S2: AFT (generalized Gamma) | 7.049 [2.714, 10.650] | 7.332 [2.801, 11.014] | 6.765 [2.651, 10.081] |
Tobit | 9.165 [3.129, 14.079] | 9.184 [2.920, 13.677] | 9.145 [3.283, 14.415] |
S1: Lasso + S2: Lasso (OLS) | 9.295 [2.082, 15.509] | 9.742 [2.019, 15.715] | 8.848 [2.118, 15.241] |
S1: GLM + S2: Lasso (OLS) | 9.313 [2.099, 15.523] | 9.776 [2.028, 15.844] | 8.850 [2.120, 15.275] |
Lasso (OLS) | 9.677 [2.553, 15.509] | 9.710 [2.316, 15.273] | 9.643 [2.739, 15.754] |
OLS | 9.703 [2.571, 15.539] | 9.737 [2.356, 15.304] | 9.668 [2.744, 15.771] |
S1: RF + S2: Lasso (OLS) | 9.907 [2.366, 16.238] | 10.373 [2.292, 16.527] | 9.440 [2.395, 15.869] |
S1: Lasso + S2: Adaptive hazard | 9.920 [2.463, 14.535] | 12.270 [2.782, 18.840] | 7.570 [2.275, 11.282] |
S1: GLM + S2: Adaptive hazard | 9.933 [2.463, 14.568] | 12.292 [2.768, 18.995] | 7.574 [2.276, 11.294] |
S1: RF + S2: Adaptive hazard | 10.226 [2.600, 14.994] | 12.504 [2.758, 19.367] | 7.947 [2.449, 11.819] |
Tweedie | 11.273 [2.558, 16.989] | 12.155 [2.331, 16.723] | 10.391 [2.879, 17.323] |
S1: GLM + S2: Cox hazard | 12.341 [2.134, 19.414] | 12.510 [1.958, 19.552] | 12.172 [2.302, 19.289] |
S1: Lasso + S2: Cox hazard | 12.353 [2.142, 19.442] | 12.525 [1.973, 19.664] | 12.181 [2.303, 19.335] |
S1: RF + S2: Cox hazard | 12.820 [2.461, 20.298] | 13.030 [2.314, 20.360] | 12.609 [2.600, 20.242] |
S1: GLM + S2: GLM-Gamma-Identity | 13.249 [2.230, 22.589] | 13.351 [2.046, 22.284] | 13.147 [2.437, 22.696] |
S1: Lasso + S2: GLM-Gamma-Identity | 13.263 [2.248, 22.583] | 13.369 [2.059, 22.267] | 13.157 [2.440, 22.713] |
S1: RF + S2: GLM-Gamma-Identity | 13.717 [2.519, 22.934] | 13.867 [2.351, 22.812] | 13.567 [2.677, 23.096] |
OLS intercept only (mean) | 17.512 [2.900, 28.989] | 17.560 [2.649, 28.778] | 17.464 [3.096, 29.118] |
Algorithms are presented in ascending order according to the overall mean MSE.
Lower MSE indicates better performance.
S1 refers to stage-1 algorithm
S2 refers to stage-2 algorithm.
Best single algorithm under high zero-inflation.
Best single algorithm under low zero-inflation.
GLM in S1 refers to logistic regression
Lasso in S1 refers to logistic Lasso regression.
RF refers to Random Forest.
GLM-Gamma-Log refers to GLM with Gamma distribution and Log link function.
Log-OLS smearing refers to logarithmic OLS with smearing retransformation.
Web Supplement-Figure 2 presents more extensive results. Overall, we found that no single algorithm was best in all cases and although in large samples the super learner theoretically should enjoy essentially the same or better performance compared with the best-performing single algorithm, in small samples there are instances where a single algorithm provides the best performance. Practically, however, the analyst will not know which algorithm fits the data best a priori, which supports the idea of using a super learner approach. Across a large number of cases, the two-stage super learner was more robust than the individual algorithms and close to the best algorithm under all different data settings, while the performances of individual algorithms changed dramatically across different data settings.
3.2. Empirical analysis
3.2.1. MEPS
Distributions of total expenditures in 2016 & 2017 were both zero-inflated, with almost 20% of observations having zero expenditures, as well as highly skewed, with 2% of observations expenditures over $50,000 (Web Supplement-Table 6, 7 & Web Supplement-Figure 3).
Both two-stage super learners (including thresholding and not) performed better than all the individual algorithms and the one-stage super learner with the lowest test-set MSE (table 2). Inclusion of thresholding algorithms led to a minor improvement in the performance of the two-stage super learner with improvements in terms of MSE. The one-stage super learner was among the top 20 algorithms with MSE lower than any single-stage individual algorithm (table 2). Relative MSE for individual algorithms compared to the two-stage super learner including thresholding ranged from 1.018 (Random Forest at both stages with indicator ) to 1.192 (model with intercept only). Adding a two-stage model to account for zero-inflation appears to improve the predictive performance of both Random Forest and Lasso relative to those algorithms being applied in a single-stage model. Choices of stage-2 models seem to matter more than the choices of stage-1 models as algorithms with the same stage-2 algorithms share analogous behavior albeit different stage-1 algorithms. Random Forests provided nontrivial improvements to parametric regressions with a lower MSE. Our analysis also revealed strong performance of a super learner for predicting the presence of any cost with a test-set AUC of 0.83 (Web Supplement-Table 9).
Table 2.
Results of top 30 algorithms in MEPS analysis
Algorithma | MSEb (108) | Relative MSE | R2 | REc | MAE |
---|---|---|---|---|---|
Two-stage Super Learner + Thresholding | 2.1491 | 1.0000 | 0.1580 | 1.0000 | 5848 |
Two-stage Super Learner | 2.1590 | 1.0046 | 0.1558 | 0.9865 | 5873 |
I(P(Y>0)>=0.5) * (S1: RF + S2: RF) | 2.1878 | 1.0180 | 0.1446 | 0.9152 | 5929 |
Discrete two-stage Learner + Indicator | 2.1878 | 1.0180 | 0.1446 | 0.9152 | 5929 |
S1: RF + S2: RF | 2.1880 | 1.0181 | 0.1445 | 0.9148 | 5990 |
Discrete two-stage Learner | 2.1880 | 1.0181 | 0.1445 | 0.9148 | 5990 |
I(P(Y>0)>=0.25) * (S1: RF + S2: RF) | 2.1891 | 1.0186 | 0.1441 | 0.9121 | 5984 |
S1: Xgboost + S2: RF | 2.1906 | 1.0193 | 0.1435 | 0.9084 | 6029 |
I(P(Y>0)>=0.25) * (S1: Xgboost + S2: RF) | 2.1907 | 1.0194 | 0.1434 | 0.9081 | 6022 |
I(P(Y>0)>=0.5) * (S1: Xgboost + S2: RF) | 2.1911 | 1.0196 | 0.1433 | 0.9070 | 5975 |
S1: GLM + S2: RF | 2.1920 | 1.0199 | 0.1429 | 0.9050 | 6027 |
I(P(Y>0)>=0.25) * (S1: GLM + S2: RF) | 2.1922 | 1.0201 | 0.1428 | 0.9043 | 6023 |
S1: LASSO + S2: RF | 2.1925 | 1.0202 | 0.1427 | 0.9036 | 6026 |
I(P(Y>0)>=0.25) * (S1: Lasso + S2: RF) | 2.1926 | 1.0202 | 0.1427 | 0.9034 | 6023 |
I(P(Y>0)>=0.5) * (S1: GLM + S2: RF) | 2.1927 | 1.0203 | 0.1426 | 0.9031 | 5985 |
I(P(Y>0)>=0.5) * (S1: Lasso + S2: RF) | 2.1931 | 1.0205 | 0.1425 | 0.9021 | 5985 |
One-stage Super Learner | 2.2068 | 1.0269 | 0.1371 | 0.8681 | 5917 |
Single: RF | 2.2273 | 1.0364 | 0.1291 | 0.8175 | 6052 |
S1: NN + S2: RF | 2.2356 | 1.0403 | 0.1259 | 0.7968 | 6034 |
I(P(Y>0)>=0.25) * (S1: NN + S2: RF) | 2.2360 | 1.0404 | 0.1257 | 0.7960 | 6034 |
I(P(Y>0)>=0.5) * (S1: NN + S2: RF) | 2.2362 | 1.0405 | 0.1257 | 0.7955 | 6006 |
S1: SVM + S2: RF | 2.2364 | 1.0406 | 0.1255 | 0.7948 | 6027 |
I(P(Y>0)>=0.25) * (S1: SVM + S2: RF) | 2.2370 | 1.0409 | 0.1253 | 0.7934 | 6015 |
I(P(Y>0)>=0.5) * (S1: SVM + S2: RF) | 2.2375 | 1.0411 | 0.1252 | 0.7923 | 6008 |
Single: Zero-inflated Poisson (ZIP) | 2.2571 | 1.0503 | 0.1175 | 0.7436 | 6106 |
I(P(Y>0)>=0.5) * (S1: RF + S2: Lasso) | 2.2579 | 1.0506 | 0.1172 | 0.7417 | 6044 |
S1: RF + S2: LASSO | 2.2580 | 1.0507 | 0.1171 | 0.7415 | 6083 |
I(P(Y>0)>=0.25) * (S1: RF + S2: Lasso) | 2.2582 | 1.0508 | 0.1170 | 0.7410 | 6081 |
S1: Xgboost + S2: Lasso | 2.2607 | 1.0519 | 0.1161 | 0.7349 | 6127 |
I(P(Y>0)>=0.25) * (S1: Xgboost + S2: Lasso) | 2.2609 | 1.0520 | 0.1160 | 0.7343 | 6124 |
Algorithms are presented in ascending order based on MSE.
Lower MSE and relative MSE indicate better performance.
Greater R2 and RE indicates better performance.
S1 refers to stage-1 algorithm
S2 refers to stage-2 algorithm
RF refers to Random Forest.
GLM in S1 refers to logistic regression
Lasso in S1 refers to logistic Lasso regression.
NN refers to Neural Network.
SVM refers to Support Vector Machine.
3.2.2. BOLD
Distributions of four spine-related RVUs were all highly skewed with heavy upper tails but varied in both scales and levels of zero-inflation (Web Supplement-Table 10, Web Supplement-Figure 3). Spine-related RVUs had the lowest zero-inflation (5%) while spine-related injection RVUs had the greatest zero-inflation (91%).
In these data, we again found that both two-stage super learners (including thresholding and not) were the best performing algorithms under all different zero-inflation levels, although improvements over the one-stage super learner (MSE improvement of around 1.6–1.7%) and the best single algorithm (MSE improvement around 1.3–1.4%) were more modest than in the MEPS data (Table 3). The impact of including thresholding algorithms in the two-stage super learner library was equivocal with improvements seen for spine-related RVUs and spine-related injection RVUs, but slight degradations in performance seen for spine-related imaging RVUs and spine-related physical therapy RVUs. Performances of single algorithms changed dramatically under different zero-inflations levels (Web Supplement-E). For example, the random forest had strong performances for modeling spine-related RVUs (5% zero) and spine-related imaging RVUs (55% zero), but poor performances for modeling spine-related physical therapy RVUs (85% zero) and spine-related injection RVUs (91% zero). One-stage super learner and single-stage algorithms showed good performances when zero-inflation was low, as illustrated by the fact that the one-stage super learner was the third best-performing algorithm, and single-stage Random Forest is the best single algorithm in modeling spine-related RVUs. However, these models did not perform as well for modeling spine-related injection RVUs, in which zero inflation is greater. Prediction performances of super learner models built for predicting any costs had cross-validated AUCs ranging from 0.55 (spine-related RVUs) to 0.90 (spine-related physical therapy RVUs) (Web Supplement-Table 13).
Table 3.
Results for top 10 algorithms in modeling four spine-related RVUs
Algorithma | MSEb | Relative MSE | R2 | REc | MAE |
---|---|---|---|---|---|
Spine-related RVUs (5% zero) | |||||
Two-stage Super Learner + Thresholding | 6253.45 | 1.0000 | 0.0713 | 1.0000 | 19.6995 |
Two-stage Super Learner | 6261.24 | 1.0012 | 0.0701 | 0.9838 | 19.7632 |
One-stage Super Learner | 6298.74 | 1.0072 | 0.0646 | 0.9057 | 19.8006 |
Discrete two-stage Learner + Thresholding | 6316.67 | 1.0101 | 0.0619 | 0.8683 | 19.8365 |
Discrete two-stage Learner | 6323.14 | 1.0111 | 0.0609 | 0.8548 | 19.8581 |
RFd | 6327.06 | 1.0118 | 0.0604 | 0.8474 | 19.8749 |
S1e: RF + S2f: RF | 6330.65 | 1.0123 | 0.0599 | 0.8401 | 19.8695 |
I(P(Y>0)>=0.5) * (S1: RF + S2: RF) | 6331.10 | 1.0124 | 0.0598 | 0.8386 | 19.8656 |
I(P(Y>0)>=0.75) * (S1: RF + S2: RF) | 6331.62 | 1.0125 | 0.0597 | 0.8379 | 19.8767 |
I(P(Y>0)>=0.75) * (S1: GLMg + S2: RF) | 6331.98 | 1.0126 | 0.0596 | 0.8364 | 19.8851 |
Spine-related imaging RVUs (55% zero) | |||||
Two-stage Super Learner | 54.9361 | 1.0000 | 0.0979 | 1.0000 | 4.2369 |
Two-stage Super Learner + Indicator | 55.2834 | 1.0063 | 0.0922 | 0.9417 | 4.2539 |
Discrete two-stage Learner | 55.4711 | 1.0097 | 0.0891 | 0.9102 | 4.2934 |
Discrete two-stage Learner + Indicator | 55.6390 | 1.0128 | 0.0863 | 0.8820 | 4.3064 |
S1: RF + S2: RF | 55.7874 | 1.0155 | 0.0842 | 0.8606 | 4.3505 |
S1: RF + S2: Lasso | 55.8547 | 1.0167 | 0.0828 | 0.8459 | 4.3264 |
I(P(Y>0)>=0.2) * (S1: RF + S2: RF) | 55.8598 | 1.0168 | 0.0827 | 0.8450 | 4.3232 |
S1: GLM + S2: Lasso | 55.8673 | 1.0169 | 0.0826 | 0.8437 | 4.3036 |
One-stage Super Learner | 55.8768 | 1.0171 | 0.0824 | 0.8421 | 4.2730 |
S1: GLM + S2: RF | 55.8850 | 1.0173 | 0.0823 | 0.8408 | 4.3299 |
Spine-related physical therapy RVUs (85% zero)h | |||||
Two-stage Super Learner | 2.4698 | 1.0000 | 0.2846 | 1.0000 | 0.5388 |
Two-stage Super Learner + Thresholding | 2.4741 | 1.0017 | 0.2834 | 0.9957 | 0.5391 |
Discrete two-stage Learner | 2.5111 | 1.0167 | 0.2734 | 0.9605 | 0.5446 |
Discrete two-stage Learner + Thresholding | 2.5163 | 1.0188 | 0.2719 | 0.9554 | 0.5452 |
S1: Lassoi + S2: Log-OLS smearingj | 2.5229 | 1.0215 | 0.2692 | 0.9460 | 0.5445 |
I(P(Y>0)>=0.02) * (S1: Lasso + S2: Log-OLS smearing) | 2.5242 | 1.0220 | 0.2689 | 0.9447 | 0.5450 |
S1: Lasso + S2: GLM-Gamma-Identity | 2.5288 | 1.0239 | 0.2674 | 0.9396 | 0.5473 |
I(P(Y>0)>=0.02) * (S1: Lasso + GLM-Gamma-Identityk) | 2.5294 | 1.0241 | 0.2673 | 0.9393 | 0.5490 |
S1: Lasso + S2: Adaptive GLM | 2.5304 | 1.0245 | 0.2671 | 0.9387 | 0.5529 |
I(P(Y>0)>=0.02) * (S1: Lasso + S2: Adaptive GLM) | 2.5307 | 1.0247 | 0.2670 | 0.9381 | 0.5483 |
One-stage Super Learner | 2.5337 | 1.0259 | 0.2661 | 0.9350 | 0.5552 |
Spine-related injection RVUs (91% zero)h | |||||
Two-stage Super Learner + Thresholding | 16.8832 | 1.0000 | 0.1323 | 1.0000 | 1.2578 |
Two-stage Super Learner | 16.9414 | 1.0034 | 0.1293 | 0.9774 | 1.2632 |
Discrete two-stage Learner + Thresholding | 17.1150 | 1.0137 | 0.1204 | 0.9100 | 1.2753 |
Discrete two-stage Learner | 17.1581 | 1.0163 | 0.1182 | 0.8932 | 1.2781 |
I(P(Y>0)>=0.01) * (S1: Lasso + S2: Lasso) | 17.2029 | 1.0189 | 0.1159 | 0.8758 | 1.2853 |
S1: Lasso + S2: Lasso | 17.2074 | 1.0192 | 0.1157 | 0.8741 | 1.2855 |
I(P(Y>0)>=0.05) * (S1: Lasso + S2: Lasso) | 17.2116 | 1.0195 | 0.1154 | 0.8724 | 1.2939 |
I(P(Y>0)>=0.01) * (S1: GLM + S2: GLM-Gamma-Identity) | 17.2284 | 1.0204 | 0.1146 | 0.8659 | 1.2647 |
S1: GLM + S2: GLM-Gamma-Identity | 17.2304 | 1.0206 | 0.1145 | 0.8651 | 1.2673 |
I(P(Y>0)>=0.01) * (S1: GLM + S2: Log-OLS smearing) | 17.2361 | 1.0209 | 0.1142 | 0.8629 | 1.2627 |
One-stage Super Learner | 17.2599 | 1.0223 | 0.1130 | 0.8537 | 1.3134 |
Algorithms are presented in ascending order based on MSE.
Lower MSE and relative MSE indicate better performance.
Greater R2 and RE indicate better performance.
RF refers to Random Forest.
S1 refers to stage-1 algorithms.
S2 refers to stage-2 algorithms.
GLM in S1 refers to logistic regression.
For spine-related physical therapy RVUs and spine-related injection RVUs, the one-stage super learner is not among the top 10 algorithms, we still list its results for comparison.
Lasso in S1 refers to logistic Lasso regression.
Log-OLS smearing refers to logarithmic OLS with smearing retransformation.
GLM-Gamma-Identity refers to GLM with Gamma distribution and Identity link function.
4. Discussion
The novel two-stage super learner algorithm provided robust performance across a wide range of situations using both simulated and real data. We also found that the one-stage super learner performed well, and in some cases when zero inflation was small, even better than the two-stage super learner. We found that some individual algorithms had a good performance in certain scenarios, but poor performance in others. Since an analyst is unlikely to know, a priori, what the best performing single algorithm will be, we believe that super learning provides a useful general approach for estimating healthcare expenditures. In other words, super learning eliminates the need for a-priori knowledge as to which single algorithm might perform best in a given application--we found that it performed well in each scenario that we studied.
Our results are consistent with and extend prior research regarding the prediction of healthcare expenditures. First, we found that many commonly used methods do perform well in particular situations. However, we did not find a single algorithm that performed well in all situations. A major contribution of this study is that the novel two-stage super learning framework worked well across a wide variety of simulated and real-world datasets. Next, we found that the one-stage super learner also generally performed well. Though one-stage super learner is commonly used for other healthcare outcomes (Rose 2013; Rose 2016; Bergquist et al. 2017; Ju 2019), its use in predicting healthcare expenditures has been relatively limited. We found that it often performs well and may deserve wider use. Finally, the findings of this study help distinguish situations when one may want to use one-stage versus two-stage super learner approaches. When zero-inflation is low, the one-stage super learner may be a reasonable strategy. However, in the presence of higher zero-inflation, using a two-stage super learner is likely to lead to better performance.
Our study suggests several directions for future work. First, we have focused on optimizing a super learner with respect to mean squared error. We believe this is an appropriate scientific goal in many situations and in particular, in situations where super learning is used as an intermediate step in drawing inference about the average causal effect. However, for some pure prediction tasks, mean squared error may not be the most appealing measure to gauge predictions since it is very sensitive to outliers. In these settings, it may be interesting to consider super learners constructed using alternative loss functions that are more robust to large costs. Second, our approach is focused on the prediction of the overall cost. In some circumstances, the scientific question may motivate us to simultaneously consider the prediction of having any cost (i.e., the stage-1 model itself is interesting) as well as the cost itself if some costs are present. For such problems, an alternative approach to super learning may be more useful, wherein one develops a super learner for each component of the model. In such an approach, each of the two super learners would be optimized towards a particular goal, as opposed to here where a single super learner is fit that is optimized towards the overall goal of predicting costs.
The findings of this study should be considered in light of several limitations. First, as with all simulation studies, we could not consider all possible data generation scenarios. While we picked a robust set of scenarios that likely reflect many real-life situations, we do not know if the results we observed generalize to other data generation processes that we did not examine. Similarly, as we analyzed only two empirical data sets, and the results may not generalize to other settings. Furthermore, we did not consider high-dimensional covariate sets in modeling the healthcare cost--settings that are common in health services and outcome research. In future work, we hope to examine more diverse scenarios and utilize high-dimensional covariates in cost prediction to further inform practitioners in the application of our methods. An additional limitation is that we focused on super learning for the sake of prediction. A natural extension would be to study the potential benefits of two-stage super learning in the context of estimating the causal effect of an intervention on a cost outcome.
5. Conclusions
In this study, we introduce a novel two-stage super learner approach that is well suited to many of the common problems encountered when analyzing healthcare expenditure data. Further, we provide a more in-depth study of the application of one-stage super learner to the healthcare expenditure context. We find that the two-stage super learner provides robust performance across a wide variety of data generating processes and that one-stage super learner also performs well, especially when there is little zero inflation. Given the notorious challenges that healthcare expenditures data present to the analyst, we think that super learner approaches will be a useful tool to advance health services research.
Supplementary Material
Acknowledgments
This study was funded under the award number 1R01HL137808 from the National Heart Lung and Blood Institute of the United States National Institute of Health. Research reported in this work was also supported by the University of Washington Clinical Learning, Evidence And Research (CLEAR) Center for Musculoskeletal Disorders Administrative, Methodologic and Resource Cores and NIAMS/NIH P30AR072572. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Funding
This study was funded under the award number 1R01HL137808 from the National Heart Lung and Blood Institute of the United States National Institute of Health. Research reported in this work was also supported by the University of Washington Clinical Learning, Evidence And Research (CLEAR) Center for Musculoskeletal Research. The CLEAR Center is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) of the National Institutes of Health under Award Number P30AR072572.
Footnotes
Conflicts of interest/Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Code availability
The code used for generating data in simulations and training the two-stage super learner, one-stage super learner and individual algorithms can be found at https://github.com/wuziyueemory/Two-stage-SuperLearner. Introductions to simulations and real-data analysis (MEPS & BOLD), as well as detailed performance evaluation of all algorithms for MEPS & BOLD are also provided on the above GitHub page.
Contributor Information
Ziyue Wu, Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA.
Seth A. Berkowitz, Division of General Medicine and Clinical Epidemiology, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Patrick J. Heagerty, Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA, USA.
David Benkeser, Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA.
Availability of data and material
The 2016–2017 MEPS data that support the findings of this study are openly available in [MEPS HC-202: MEPS Panel 21 Longitudinal Data File] at https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202. The BOLD data that support the findings of this study are openly available in [BOLD data analysis] at https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/BOLD%20data%20analysis/bold%20data.csv.
References
- Basu A, Manning WG. Issues for the next generation of health care cost analyses. Med Care. 2009. Jul;47(7 Suppl 1):S109–14. 10.1097/MLR.0b013e31819c94a1. [DOI] [PubMed] [Google Scholar]
- Basu A, Rathouz PJ. Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics. 2005. Jan;6(1):93–109. 10.1093/biostatistics/kxh020. [DOI] [PubMed] [Google Scholar]
- Benkeser D, Cai W, van der Laan MJ. Rejoinder: A Nonparametric Superefficient Estimator of the Average Treatment Effect. Stat Sci. 2020. Aug;35(3):511–517. 10.1214/20-STS789. [DOI] [Google Scholar]
- Benkeser D, Petersen M, van der Laan MJ. Improved small-sample estimation of nonlinear cross-validated prediction metrics. J Am Stat Assoc. 2020;115(532):1917–1932. 10.1080/01621459.2019.1668794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergquist SL, Brooks GA, Keating NL, Landrum MB, Rose S. Classifying Lung Cancer Severity with Ensemble Machine Learning in Health Care Claims Data. Proc Mach Learn Res. 2017. Aug;68:25–38. [PMC free article] [PubMed] [Google Scholar]
- Berk ML, Monheit AC. The concentration of health care expenditures, revisited. Health Aff (Millwood). 2001. Mar-Apr;20(2):9–18. 10.1377/hlthaff.20.2.9. [DOI] [PubMed] [Google Scholar]
- Blough DK, Madden CW, Hornbrook MC. Modeling risk using generalized linear models. J Health Econ. 1999. Apr;18(2):153–71. 10.1016/s0167-6296(98)00032-0. [DOI] [PubMed] [Google Scholar]
- Breiman L Stacked regressions. Mach Learn. 1996. Jul;24:49–64. 10.1007/BF00117832 [DOI] [Google Scholar]
- Cawley J, Meyerhoefer C. The medical care costs of obesity: an instrumental variables approach. J Health Econ. 2012. Jan;31(1):219–30. 10.1016/j.jhealeco.2011.10.003. [DOI] [PubMed] [Google Scholar]
- Cohen JW, Cohen SB, Banthin JS. The medical expenditure panel survey: a national information resource to support healthcare cost research and inform policy and practice. Med Care. 2009. Jul;47(7 Suppl 1):S44–50. 10.1097/MLR.0b013e3181a23e3a. [DOI] [PubMed] [Google Scholar]
- Cohen SB. Design strategies and innovations in the medical expenditure panel survey. Med Care. 2003. Jul;41(7 Suppl):III5–III12. 10.1097/01.MLR.0000076048.11549.71. [DOI] [PubMed] [Google Scholar]
- Cox DR. Regression Models and Life-Tables. J R Stat Soc Series B Stat Methodology, 34: 187–202, 1972. 10.1111/j.2517-6161.1972.tb00899.x [DOI] [Google Scholar]
- Meyer David, Dimitriadou Evgenia, Hornik Kurt, Weingessel Andreas and Leisch Friedrich (2021). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7–9. https://CRAN.R-project.org/package=e1071 [Google Scholar]
- Deb P, Norton EC. Modeling Health Care Expenditures and Use. Annu Rev Public Health. 2018. Apr 1;39:489–505. 10.1146/annurev-publhealth-040617-013517. [DOI] [PubMed] [Google Scholar]
- Duan N Smearing Estimate: A Nonparametric Retransformation Method. J Am Stat Assoc. 78:383, 605–610, 1983. 10.1080/01621459.1983.10478017. [DOI] [Google Scholar]
- Finkelstein EA, Trogdon JG, Cohen JW, Dietz W. Annual medical spending attributable to obesity: payer-and service-specific estimates. Health Aff (Millwood). 2009. Sep-Oct;28(5):w822–31. 10.1377/hlthaff.28.5.w822. [DOI] [PubMed] [Google Scholar]
- Gilleskie DB, Mroz TA. A flexible approach for estimating the effects of covariates on health expenditures. J Health Econ. 2004. Mar;23(2):391–418. 10.1016/j.jhealeco.2003.09.008. [DOI] [PubMed] [Google Scholar]
- Glass KP, Anderson JR. Relative value units: from A to Z (Part I of IV). J Med Pract Manage. 2002. Mar-Apr;17(5):225–8. [PubMed] [Google Scholar]
- Gregori D, Petrinco M, Bo S, Desideri A, Merletti F, Pagano E. Regression models for analyzing costs and their determinants in health care: an introductory review. Int J Qual Health Care. 2011. Jun;23(3):331–41. 10.1093/intqhc/mzr010. [DOI] [PubMed] [Google Scholar]
- James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: with Applications in R. Springer; (2013) [Google Scholar]
- Jarvik JG, Comstock BA, Bresnahan BW, Nedeljkovic SS, Nerenz DR, Bauer Z, Avins AL, James K, Turner JA, Heagerty P, Kessler L, Friedly JL, Sullivan SD, Deyo RA. Study protocol: the Back Pain Outcomes using Longitudinal Data (BOLD) registry. BMC Musculoskelet Disord. 2012. May 3;13:64. 10.1186/1471-2474-13-64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones AM. Models for Health Care. HEDG, c/o Department of Economics, University of York, Health, Econometrics and Data Group (HEDG) Working Papers. 2010. 10.1093/oxfordhb/9780195398649.013.0024. [DOI] [Google Scholar]
- Ju C, Combs M, Lendle SD, Franklin JM, Wyss R, Schneeweiss S, van der Laan MJ. Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods. J Appl Stat. 2019;46(12):2216–2236. 10.1080/02664763.2019.1582614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kessler RC, Rose S, Koenen KC, Karam EG, Stang PE, Stein DJ, Heeringa SG, Hill ED, Liberzon I, McLaughlin KA, McLean SA, Pennell BE, Petukhova M, Rosellini AJ, Ruscio AM, Shahly V, Shalev AY, Silove D, Zaslavsky AM, Angermeyer MC, Bromet EJ, de Almeida JM, de Girolamo G, de Jonge P, Demyttenaere K, Florescu SE, Gureje O, Haro JM, Hinkov H, Kawakami N, Kovess-Masfety V, Lee S, Medina-Mora ME, Murphy SD, Navarro-Mateu F, Piazza M, Posada-Villa J, Scott K, Torres Y, Carmen Viana M. How well can post-traumatic stress disorder be predicted from pre-trauma risk factors? An exploratory study in the WHO World Mental Health Surveys. World Psychiatry. 2014. Oct;13(3):265–74. 10.1002/wps.20150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lê Cook B, McGuire TG, Lock K, Zaslavsky AM. Comparing methods of racial and ethnic disparities measurement across different settings of mental health care. Health Serv Res. 2010. Jun;45(3):825–47. 10.1111/j.1475-6773.2010.01100.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- LeBlanc M, Tibshirani R. Combining Estiamates in Regression and Classification. J Am Stat Assoc. 1996. Dec;91(436):1641–50. 10.2307/2291591. [DOI] [Google Scholar]
- Manning WG, Basu A, Mullahy J. Generalized modeling approaches to risk adjustment of skewed outcomes data. J Health Econ. 2005. May;24(3):465–88. 10.1016/j.jhealeco.2004.09.011. [DOI] [PubMed] [Google Scholar]
- Manning WG, Mullahy J. Estimating log models: to transform or not to transform? J Health Econ. 2001. Jul;20(4):461–94. 10.1016/s0167-6296(01)00086-8. [DOI] [PubMed] [Google Scholar]
- Morid MA, Kawamoto K, Ault T, Dorius J, Abdelrahman S. Supervised Learning Methods for Predicting Healthcare Costs: Systematic Literature Review and Empirical Evaluation. AMIA Annu Symp Proc. 2018. Apr 16;2017:1312–1321. [PMC free article] [PubMed] [Google Scholar]
- Mullahy J Much ado about two: reconsidering retransformation and the two-part model in health econometrics. J Health Econ. 1998. Jun;17(3):247–81. 10.1016/s0167-6296(98)00030-7. [DOI] [PubMed] [Google Scholar]
- Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. Lancet Respir Med. 2015. Jan;3(1):42–52. 10.1016/S2213-2600(14)70239-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rose S A Machine Learning Framework for Plan Payment Risk Adjustment. Health Serv Res. 2016. Dec;51(6):2358–2374. 10.1111/1475-6773.12464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rose S Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol. 2013. Mar 1;177(5):443–52. 10.1093/aje/kws241. [DOI] [PubMed] [Google Scholar]
- Shrestha A, Bergquist S, Montz E, Rose S. Mental Health Risk Adjustment with Clinical Categories and Machine Learning. Health Serv Res. 2018. Aug;53 Suppl 1(Suppl Suppl 1):3189–3206. 10.1111/1475-6773.12818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Tianqi, He Tong, Benesty Michael, Khotilovich Vadim, Tang Yuan, Cho Hyunsu, Chen Kailong, Mitchell Rory, Cano Ignacio, Zhou Tianyi, Li Mu, Xie Junyuan, Lin Min, Geng Yifeng and Li Yutian (2021). xgboost: Extreme Gradient Boosting. R package version 1.5.0.2 https://CRAN.R-project.org/package=xgboost
- van der Laan MJ, Dudoit S, Van Der Vaart AW. The cross-validated adaptive epsilon-net estimator. Stat decis. 2006. Dec;24(3):373–395. 10.1524/stnd.2006.24.3.373. [DOI] [Google Scholar]
- van der Laan MJ, Dudoit S. Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples (November 2003). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 130. https://biostats.bepress.com/ucbbiostat/paper130. [Google Scholar]
- van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
- Venables WN & Ripley BD (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 [Google Scholar]
- Wang HJ, Zhou XH. Estimation of the retransformed conditional mean in health care cost studies, Biometrika. 2010. Mar;97(1):147–58. 10.1093/biomet/asp072 [DOI] [Google Scholar]
- Wolpert DH. Stacked generalization. Neural Netw. 5(2):241–259, 1992. 10.1016/S0893-6080(05)80023-1. [DOI] [Google Scholar]
- Yu K, Lu Z, Stander J. Quantile Regression: Applications and Current Research Areas. J R Stat Soc Series D Stat. 2003;52(3):331–350. 10.1111/1467-9884.00363. [DOI] [Google Scholar]
- Zink A, Rose S. Fair regression for health care spending. Biometrics. 2020. Sep;76(3):973–982. 10.1111/biom.13206. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The 2016–2017 MEPS data that support the findings of this study are openly available in [MEPS HC-202: MEPS Panel 21 Longitudinal Data File] at https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-202. The BOLD data that support the findings of this study are openly available in [BOLD data analysis] at https://github.com/wuziyueemory/Two-stage-SuperLearner/blob/master/BOLD%20data%20analysis/bold%20data.csv.