Abstract
An important informatics tool for controlling healthcare costs is accurately predicting the likely future healthcare costs of individuals. To address this important need, we conducted a systematic literature review and identified five methods for predicting healthcare costs. To enable a direct comparison of these different approaches, we empirically evaluated the predictive performance of each reported approach, as well as other state-of-the-art supervised learning methods, using data from University of Utah Health Plans for October 2013 through October 2016. The data set consisted of approximately 90,000 individuals, 6.3 million medical claims and 1.2 million pharmacy claims. In this comparative analysis, gradient boosting had the best predictive performance overall and for low to medium cost individuals. For high cost individuals, Artificial Neural Network (ANN) and the Ridge regression model, which have not been previously reported for use in healthcare cost prediction, had the highest performance.
Introduction
The United States’ national health expenditure (NHE) grew 5.8% to $3.2 trillion in 2015 (i.e., $9,990 per person), which accounted for 17.8% of the nation’s gross domestic product (GDP)1. In seeking to control these unsustainable increases in healthcare costs, it is imperative that healthcare organizations can predict the likely future costs of individuals, so that care management resources can be efficiently targeted to those individuals at highest risk of incurring significant costs2. Key stakeholders in these efforts to manage healthcare costs include health insurers, employers, society, and increasingly healthcare delivery organizations due to the transition from fee-for-service payment models to value-based payment models3. For any given individual, insurers generally have the most comprehensive information on healthcare costs as they pay for care delivered across various healthcare delivery organizations.
Predicting healthcare costs for individuals using accurate prediction models is important for various stakeholders beyond health insurers, and for various purposes4. For health insurers and increasingly healthcare delivery systems, accurate forecasts of likely costs can help with general business planning in addition to prioritizing the allocation of scarce care management resources. Moreover, for patients, knowing in advance their likely expenditures for the next year could potentially allow them to choose insurance plans with appropriate deductibles and premiums.
Despite the importance of healthcare cost prediction, to our knowledge there has been no review of the literature on this important topic. Therefore, we conducted a systematic literature review. Moreover, in order to enable a direct comparison of approaches on a common data set, we evaluated each of the identified approaches on a health insurance data set from the University of Utah Health Plans. We also evaluated additional state-of-the-art methods not previously evaluated in the literature.
Methods
Literature Review
Adapting a search strategy from a previous systematic review5, we searched Google Scholar and MEDLINE. The latest search was performed on February 21, 2017. We used a combination of the following search terms: healthcare cost prediction, medical claim cost, pharmacy claim cost; healthcare expenditure prediction; healthcare risk score prediction; and patient cost prediction.
In conducting the systematic literature review, we sought to answer the following questions. Because the answer to the first question identified that using features of prior costs to predict future costs performed as well as or better than approaches that also used clinical data for cost prediction purposes, all subsequent questions were focused on approaches that used prior cost features to predict future costs (referred to henceforth as “cost on cost prediction”).
What are the types of healthcare cost prediction approaches reported in the literature?
What are the input features that have been used for cost on cost prediction?
What are the supervised learning methods that have been used for cost on cost prediction?
What are the performance measures and evaluation results for cost on cost prediction?
Direct Comparison of Alternative Cost Prediction Methods using a Health Insurer Data Set
Approach. This study was approved by the University of Utah Institutional Review Board (Protocol # 00094358). We used a health insurance data set to directly compare the performance of cost on cost prediction approaches identified in the literature, as well as other state-of-the-art supervised learning techniques.
Data. Our data set consisted of 6.3 million medical claims and 1.2 million pharmacy claims from approximately 91,000 distinct individuals covered by University of Utah Health Plans from October 2013 to October 2016. Available data included demographic information (e.g., age, gender, age), clinical encounter information (e.g., place and date of service, provider information), diagnosis and procedure codes, pharmacy dispense information, and cost information (e.g., paid, allowed and billed amount). This data was filtered to individuals with insurance membership for the whole three years period, which resulted in approximately 3.8 million medical claims and 780,000 pharmacy claims from 24,000 patients.
The data set was divided into two time periods: an observation period and a result period. The former time period was from October 2013 to September 2015 (i.e., two years), which was used to predict individuals’ cost in the result period ranging from October 2015 to October 2016 (i.e., one year). Table 1 shows all input features used in this study. All features used in this study were cost related features extracted from Bertsimas et al.6, which had the largest and most complete set of cost related features among the reviewed manuscripts. If a member did not have any cost for a specific month it was considered as zero; therefore, there are no missing values in this dataset.
Table 1.
Features used to develop the prediction models.
| Feature | Description | Number of features |
|---|---|---|
| Overall_costs | The sum of medical and pharmacy costs | 1 |
| Overall_medical_costs | - | 1 |
| Overall_pharmacy_costs | - | 1 |
| Six_costs | Overall cost in the last 6 months of the observation period | 1 |
| Three_costs | Overall cost in the last 3 months of the observation period | 1 |
| Trend | Found by fitting a line and extracting the slope through the last monthly costs of the observation period | 1 |
| Acute | An indicator variable found by comparing the highest month with the average monthly cost. If these are significantly different, the indicator takes on the value 1. The idea is that there is a high chance that constantly high cost individuals repeat their cost in the future, while individuals who have had temporarily high cost have a lower chance. | 1 |
| Highest_cost | The cost of the highest month in the observation period | 1 |
| Num_above_average | This variable is calculated as the number of months above average and is an indicator of the shape of the cost profile. If the cost is relatively constant over the period, this variable takes on a value around six, which is an indicator for a chronic cost profile. | 1 |
| Monthly_costs [] | Monthly costs of the last twelve months of the observation period (see Data section) | 12 |
The range of paid amounts in the result period showed that 80% of the overall cost of the population came from only 15% of the members. Therefore, aligned with the literature on cost bucketing, to reduce the effects of extremely expensive members, the data set was partitioned into five different cost buckets. This partitioning was done so that the sum of members’ costs in each bucket was approximately the same in the observation period (i.e., the total dollar amount in each bucket was the same). For instance, 84% of members are in bucket 1 with the same total cost amount as the members in bucket 5, which contains about 2% of the population.
Classifier. Classifiers evaluated included Linear Regression, Lasso7, Ridge8, Elastic Net9, CART10, M511, Random Forest12, Bagging13, Gradient Boosting13, SVM14, and ANN15. Except for CART, the other classifiers had not been previously evaluated for cost on cost prediction. All models were optimized on their parameters to get their best parameter setting on 30 percent of the data set. Models were evaluated with the following parameter settings: number of hidden layers, number of nodes in each layer, learning rate, and momentum were varied for the Neural Network; kernel type along with the corresponding parameters of each kernel type were varied for the Support Vector Machine; minimum split and minimum number of sample in each leaf were varied for the M5 and CART; learning rate and loss function for the Gradient Boosting; and alpha was varied for the Lasso, Ridge and Elastic Net.
A brief description of all the models used in this study (except linear regression) is provided below.
Lasso: This is a linear regression model enhanced with variable selection and regularization, which is given by the L1-norm (the loss function is the linear least squares error)7.
Ridge: This is a linear regression model where the regularization is given by the L2-norm (the loss function is the linear least squares error). L2-norm equips the model to have non-sparse coefficients, which means many coefficients with zero values or very small values with few large coefficients8.
Elastic Net: This is linear regression model that linearly combines the L1-norm and L2-norm penalties of the Lasso and Ridge models9.
CART: This is a regression decision tree, where on each node the algorithm chooses the split that minimizes the sum of squared errors for regression of the node. The important quality is that the algorithm uses the sample mean of the instances in each node for regression10.
M5: Similar to CART, this algorithm is also a regression tree, where a linear regression model is used for building the model and calculating the sum of error as opposed to the mean 11.
Random Forest: This is an ensemble learning algorithm that fits a number of regression decision trees on several subsamples of the data. The mean value of the outcomes of the regression tree is generated as the final prediction of the algorithm12.
Support Vector Machine: This is a support vector regression model implemented based on libsvm14 which uses kernels to find the regression lines.
Bagging: This is an ensemble learning algorithm that fits each base regression model on random subsets of the data that are generated by a bootstrapping sample method. Aggregation of the individual predictors is performed by averaging to form the final prediction13.
Gradient Boosting: This is an ensemble learning algorithm, where the final model is an ensemble of weak regression decision tree models, which are built in a forward stage-wise fashion. The most important attribute of the algorithm is that it ensembles the models by allowing optimization of an arbitrary loss function. In other words, each regression tree is fitted on the negative gradient of the given loss function, which is set to the least absolute deviation13.
Artificial Neural Network (ANN): This is a large collection of processing units (i.e., neurons), where each unit is connected with many others. Neural networks typically consist of multiple layers and the goal is to solve problems in the same way that the human brain would15.
20-fold cross validation was employed as the evaluation method on 70% of the data set. For statistical significance, we first applied the Friedman’s test to verify differences among multiple classifiers. If significant at an alpha level of 0.05, pairwise comparisons were made with the Wilcoxon Signed-Rank test. This statistical approach was aligned with the method recommended by Demsar16.
Results
Literature Review
1. What are the types of healthcare cost prediction approaches reported in the literature?
There are three kinds of methods that have been reported for cost prediction: rule-based, statistical and supervised learning. The disadvantage of the rule based methods (e.g. Kronick et al.17) is that they require a lot of domain knowledge, which is not easily available and is often expensive18. Although statistical models, mainly multiple regression models, are powerful tools for capturing the relationships between the predictors and the dependent variable, they have two important challenges 18. One is that working with several independent variables often causes multicolinearity, which is caused by the presence of significant correlations among predictors. Moreover, their performance is challenged by the skewed nature of healthcare data, where cost data typically feature a spike at zero, distributions are strongly skewed with a heavy right-hand tail19, and extreme values can be present, all of which make them inefficient in small to medium sample sizes if the underlying distribution is not normal. Although several advanced statistical methods have been proposed to accommodate the skewness observed in healthcare data, this type of prediction method is not able to outperform supervised learning methods20. Therefore, this paper is devoted to the use of supervised learning methods for cost prediction, and the remainder of the literature review excludes other types of prediction methods.
There are generally three types of literature that use supervised learning for cost prediction. In the first type, the goal is to predict cost using medical predictors. In this type of literature, the main goal is to show the effect of medical factors such as chronic disease score on cost prediction21. In the second type of literature (which is limited), cost predictors with or without medical predictors are used to predict cost. In the last type of literature, researchers bucket individuals’ costs and predict an individual’s cost bucket rather than his or her actual costs. This last type of research applies nominal predictive models rather than numerical predictive models.
Cost prediction using non-cost predictors. Lee et al.22 provided one of the earliest works on predicting cost by using non-cost predictors. They selected a small sample of 492 patients from a hospital in Korea and compared the performance of ANN and a classification and regression tree for cost prediction. Demographic information, diagnosis codes, number of laboratory tests, the number of admissions and number of operations were the predictors of their analysis. The results showed the superiority of ANN.
Powers et al.23 evaluated several regression statistical modeling approaches for predicting prospective total annual health costs (medical plus pharmacy) of health plan participants using Pharmacy Health Dimensions (PHD), a pharmacy claims-based risk index. Their models included ordinary least squares (OLS) regression, log-transformed OLS regression with smearing estimator, and 3 two-part models using OLS regression, log-OLS regression with smearing estimator, and generalized linear modeling (GLM), respectively. The results showed that most PHD drug categories were significant independent predictors of total costs. The OLS model had the lowest mean absolute prediction error and highest R2. The main conclusion was that the PHD system derived solely from pharmacy claims data can be used to predict future total health costs.
Analyzing the impact of multimorbidity (i.e., co-occurrence of more than three chronic disease conditions) on health care costs, König et al.21 interviewed 1,050 randomly selected primary care patients aged 65 to 85 years suffering from multimorbidity in Germany. A conditional inference tree algorithm was used as the classifier. The results showed that Parkinson’s disease and cardiac insufficiency were the most influential predictors for total costs, and that the high total costs of Parkinson’s disease were largely due to costs of nursing care.
Cost bucket prediction. Lahiri et al.4 predicted the rise in patient care costs as a binary classification problem. They used a data set with more than 114,000 patients for a span of three years (2008-2010) to investigate which patients experienced increases in inpatient expenditures between 2008 and 2009. Using stacked generalization, they ensembled six classifications algorithms including gradient boosting machine, conditional inference tree, neural networks, SVM, logistic regression and Naive Bayes. This achieved 80% recall, 78% accuracy and 76% precision. One of the contributions of the paper was that they initially had 12,400 features, most of them arising out of diagnosed conditions and drugs taken, and selected 44 of them according to their information gain. This helped the authors to identify major factors which were crucial in determining whether an individual was going to incur higher healthcare expenditure going forward. In a similar study, Guo et al.24 tried to predict patients’ transition from one cost bucket to another bucket in the following year. To do so, they applied multiple methods (each for a single type of transition) to improve the prediction performance. The results showed that they could improve the performance for 21% comparing to baselines. Moreover, they found that the proposed method can help health care entities achieve efficient resource allocation while improving care quality. Reviewing all papers in this category, we found no studies on categorical cost prediction that used cost-based features as the input.
Cost prediction using cost predictors (cost on cost prediction). Bertsimas et al.6 provided one of the first evaluations in the area of health cost prediction using supervised learning techniques. They used a combination of medical, demographic and cost related features from August 2004 to July 2006 as the input and applied regression decision tree and clustering to predict total patient costs in 2007, as measured by insurance payments including medical and pharmacy payments. The results showed that utilizing just 22 cost related features as input and a CART regression decision tree as the classifier gave almost the same performance as adding the medical and demographic information (total of around 1500 features) or applying clustering techniques. Performance was reported in terms of Mean Absolute Error, Hit Ratio, R2 and a penalty based evaluation designed by the authors. Bucketing was also used to evaluate the prediction results to assess the accuracy. This evaluation showed that while the method is strong at predicting low cost buckets, it had a weak performance on higher cost buckets.
Following the above study, Sushmita et al.18 evaluated the use of a regression tree, M5 model tree and random forest for cost prediction and showed that M5 had the best performance. The results also confirmed that prior healthcare costs alone can serve as a good indicator for future healthcare costs. To predict patients’ cost for the next year, they used the Medical Expenditure Panel Survey (MEPS) data set coming from responses to panel surveys given to households and their employers, medical providers, and insurance providers over two year periods.
Duncan et al.2 compared several different supervised learning and statistical models to predict patients’ cost including M5, Lasso and boosted trees. They applied their experiments on 30,000 patients where the information from 2008 was used for training and the total allowed amounts in the claims from 2009 were used for testing. They involved a variety of predictors as input including the previous year’s total cost, total medical cost, total pharmacy cost, demographic information, total visits and chronic conditions (83 different conditions). The results showed that boosted tress and M5 were the most effective classifier in terms of R2 and Mean Absolute Error (MAE) respectively, and that cost predictors were the strongest predictors. Moreover, confirming previous literature results, this paper showed that statistical methods are not as good as supervised learning techniques.
Kuo et al.25 attempted to show the significance of pharmacy-based metrics as opposed to diagnosis-based morbidity measures in predicting patients’ costs and outpatient visits. They used data from 2006 to predict patients’ billed costs in 2007. To achieve this, they applied linear regression on the data set. Evaluation was done based on Mean Absolute Error and R2. Although the purpose of the study was to explore the capability of the pharmacy-based metric in cost prediction, the results confirmed that using cost based features for cost prediction has almost the same accuracy as adding other types of features to the input. This paper did not incorporate sophisticated cost features and just used a single cost feature from 2006. Frees et al. 26 studied the ability of linear regression to predict individuals’ costs in terms of healthcare insurance payments. They used self-rated physical health and self-rated mental health, provided by participants, using demographic and survey-based information as their input. Getting a reasonable performance (i.e., R2=0.27), they found that cost, self-rated mental health and self-rated physical health are the most important predictors.
Collectively, and in particular in the study by Bertsimas et al.6, these studies found that cost on cost prediction can match the performance of predictions made using clinical input factors or clinical plus cost input factors.
2. What are the input features that have been used for cost on cost prediction?
Input features are one of the essential parts of a supervised learning task. Numeric cost prediction studies have benefited from a variety of features as input, which are summarized in Table 2. As seen, Bertsimas et al.6 evaluated a wide range of cost inputs and reported the performance of cost inputs separately. Their results showed that prediction using a superset of 1542 features, including clinical features, had the same performance as using just the 21 cost predictors. This finding was confirmed by other researchers in subsequent work 2,18.
Table 2.
Input features used for cost on cost prediction in the literature
| Paper | Number of Cost Inputs | Cost Inputs | Non Cost Inputs |
|---|---|---|---|
| Bertsimas (2008)6 | 21 | Monthly cost (12), Total pharmacy cost, Total medical cost, Total cost, Total cost in last 6 months, Total cost in last 3 months, Trend, Acute, Months above average, Cost of highest month | Age, Sex, Diagnosis groups, Count of claims with diagnosis codes from each group, Procedure groups, Drug groups, Count of members’ diagnoses, procedures, Drugs, Gender, Age |
| Duncan (2016)2 | 4 | Professional costs, Pharmacy costs, Outpatient costs, Inpatient costs | Age, Sex, Diagnose codes grouped into coexisting condition categories, Total visit count, Hospital admission count, Primary care provider visits count |
| Sushmita (2015)18 | 1 | Total previous cost | Age, Sex, Diagnosis Groups, Procedure groups, Comorbidity scores |
| Kuo (2011)25 | 1 | Previous medication cost | Age, Sex, Elixhauser’s index, Pharmacy-based metrics |
| Frees (2013)26 | 1 | Total previous cost | Sex, Race, Region, Education, Job, Marriage, Income level, Self-rated physical health, Self-rated mental health |
3. What are the supervised learning methods that have been used for cost on cost prediction?
There are a variety of supervised learning methods that have been used in this area. Table 3 summarizes all different methods that have been reported as successful methods for cost on cost prediction. These methods include Lasso, which is a type of linear regression, gradient boosting on regression decision trees, M5 regression decision tree, random forest, linear regression and CART regression tree. Table 3 also shows the target type of the cost that was studied in each paper. Billed amount is the total amount that is charged by the health care provider and the paid amount is the amount that is paid by the insurance company.
Table 3.
Supervised learning methods used for cost on cost prediction in literature
| Paper | Method | Outcome |
|---|---|---|
| Duncan (2016)2 | Gradient Boosting DT, Lasso, M5 | Paid amount |
| Sushmita (2015)18 | M5, RandomForest, CART | Billed amount |
| Frees (2013)26 | Linear regression | Paid amount |
| Kuo (2011)25 | Linear regression | Billed amount |
| Bertsimas (2008)6 | CART | Paid amount |
4. What are the performance measures and evaluation results for cost on cost prediction?
MAE: This shows the average error of the model on prediction of the actual cost values and is calculated as follows:
where ai and pi are the actual and predicted costs of member i in the result period respectively.
Mean absolute percentage error (MAPE)25: This is a modified version of absolute error in which the MAE is divided by the mean of the cost, so that the MAE could be compared across the models with different means of cost:
MAPE = - MAE is dependent on the data set, such that different models from different studies cannot be directly compared using that measure. MAPE is a relative measure and does not have this limitation.
R2: This shows the Pearson correlation between the actual and predicted cost values:
Hit Ratio: This measure shows the percentage of the members for whom a model forecasts the correct cost bucket:
Penalty Error6: This is a performance measure for cost prediction based on domain knowledge. Penalty error penalizes models for underestimating high cost members more than overestimating low cost members, which is motivated by the estimated opportunity loss. Table 4 shows the penalty table for the five-cost-bucket scheme. The final value of the penalty error is calculated from the average forecast penalty per member of a given sample.
Table 4.
Penalty table based on the predicted and actual cost buckets
| Actual Bucket | |||||
|---|---|---|---|---|---|
| Predicted Bucket | 0 | 2 | 4 | 6 | 8 |
| 1 | 0 | 2 | 4 | 6 | |
| 2 | 1 | 0 | 2 | 4 | |
| 3 | 2 | 1 | 0 | 2 | |
| 4 | 3 | 2 | 1 | 0 | |
Table 5 summarizes the evaluation measures used in different papers. The reported performance measures in this table correspond to the whole data set used in each study. This study reports the experimental results in terms of all five performance measures except MAE, which is not reported given the sensitivity of absolute cost data.
Table 5.
Performance measures and outcome for cost on cost prediction in literature
| Measure | Bertsimas (2008)6 | Duncan (2016)2 | Sushmita (2015)18 | Kuo (2011)25 | Frees (2013)26 |
|---|---|---|---|---|---|
| MAE($) | 2,214 | 3,104 | 8,112 | 507 | 2,705 |
| MAPE | - | - | - | 0.75 | 5.25 |
| R2 | 0.16 | 0.20 | - | 0.47 | 0.27 |
| Hit Ratio | 84.6 | - | - | - | - |
| Penalty Error | 0.38 | - | - | - | - |
Direct Comparison of Alternative Cost Prediction Methods using a Health Insurer Data Set
Tables 6 to 9 show the performance comparison between different supervised learning models on training and validation data sets. As seen, Gradient Boosting had the highest performance in terms of all measures in all buckets except bucket five. Here, ANN was superior. Also, the Ridge model showed a comparable performance compared to ANN, especially for low cost buckets.
Table 6.
Performance comparison among different supervised learning models for numeric measures on the training data set. Models that are annotated with (l) have been used in the cost on cost prediction literature before (see Table 3), while those annotated with (n) are new to this study.
| MAPE | R2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | All | 1 | 2 | 3 | 4 | 5 | All | 1 | 2 | 3 | 4 | 5 |
| Gradient Boosting (l) | 0.62 | 0.74 | 0.61 | 0.58 | 0.58 | 0.53 | 0.48 | 0.07 | 0.14 | 0.18 | 0.16 | 0.35 |
| ANN (n) | 0.67 | 0.83 | 0.67 | 0.64 | 0.50 | 0.42 | 0.46 | 0.04 | 0.10 | 0.14 | 0.27 | 0.46 |
| Ridge (n) | 0.69 | 0.83 | 0.67 | 0.65 | 0.50 | 0.41 | 0.44 | 0.04 | 0.09 | 0.14 | 0.29 | 0.45 |
| SVM (n) | 0.75 | 1.01 | 0.70 | 0.66 | 0.56 | 0.50 | 0.43 | 0.04 | 0.09 | 0.13 | 0.22 | 0.37 |
| Elastic Net (n) | 0.77 | 1.01 | 0.70 | 0.67 | 0.58 | 0.50 | 0.42 | 0.04 | 0.09 | 0.13 | 0.19 | 0.33 |
| Lasso (l) | 0.80 | 1.13 | 0.71 | 0.67 | 0.58 | 0.51 | 0.42 | 0.04 | 0.08 | 0.13 | 0.19 | 0.34 |
| M5 (l) | 0.80 | 1.13 | 0.71 | 0.68 | 0.56 | 0.51 | 0.42 | 0.04 | 0.08 | 0.13 | 0.19 | 0.33 |
| Linear Regression (l) | 0.80 | 1.14 | 0.72 | 0.67 | 0.58 | 0.51 | 0.42 | 0.04 | 0.08 | 0.13 | 0.18 | 0.34 |
| Random Forest (l) | 0.90 | 1.14 | 0.87 | 0.74 | 0.73 | 0.55 | 0.41 | 0.03 | 0.06 | 0.14 | 0.07 | 0.37 |
| Bagging (n) | 0.90 | 1.14 | 0.85 | 0.77 | 0.65 | 0.57 | 0.40 | 0.02 | 0.06 | 0.10 | 0.08 | 0.36 |
| CART (l) | 0.95 | 1.17 | 0.94 | 0.80 | 0.74 | 0.62 | 0.32 | 0.02 | 0.05 | 0.04 | 0.05 | 0.21 |
Table 9.
Performance comparison among different supervised learning models for categorial measures on the validation data set. Models that are annotated with (l) have been used in the cost on cost prediction literature before (see Table 3), while those annotated with (n) are new to this study.
| Hit Ratio (%) | Penalty Error | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | All | 1 | 2 | 3 | 4 | 5 | All | 1 | 2 | 3 | 4 | 5 |
| GradientBoosting (l) | 92.9 | 96.4 | 72.3 | 61.2 | 50.8 | 35.2 | 0.20 | 0.12 | 0.67 | 0.92 | 1.08 | 1.20 |
| ANN (n) | 89.2 | 94.0 | 69.5 | 53.9 | 54.9 | 49.6 | 0.22 | 0.14 | 0.70 | 0.97 | 0.98 | 0.97 |
| Ridge (n) | 89.1 | 93.9 | 69.4 | 55.4 | 53.6 | 49.7 | 0.22 | 0.14 | 0.71 | 0.94 | 0.99 | 0.96 |
| SVM (n) | 88.9 | 93.7 | 67.9 | 55.4 | 52.4 | 48.1 | 0.22 | 0.14 | 0.70 | 0.93 | 1.07 | 1.11 |
| ElasticNet (n) | 88.5 | 93.5 | 66.6 | 55.4 | 50.0 | 47.7 | 0.22 | 0.14 | 0.70 | 0.93 | 1.13 | 1.15 |
| Lasso (l) | 88.4 | 93.4 | 65.9 | 54.7 | 50.8 | 47.7 | 0.22 | 0.14 | 0.69 | 0.94 | 1.12 | 1.16 |
| M5 (l) | 88.4 | 93.4 | 65.8 | 54.5 | 50.8 | 47.1 | 0.22 | 0.14 | 0.69 | 0.94 | 1.12 | 1.16 |
| LinearRegression (l) | 88.4 | 93.4 | 65.7 | 54.2 | 50.8 | 46.6 | 0.22 | 0.14 | 0.69 | 0.94 | 1.12 | 1.15 |
| RandomForest (l) | 85.6 | 90.9 | 61.8 | 46.3 | 49.6 | 45.5 | 0.24 | 0.16 | 0.74 | 1.07 | 1.09 | 1.11 |
| Bagging (n) | 85.6 | 90.9 | 61.6 | 50.2 | 44.7 | 39.8 | 0.24 | 0.16 | 0.73 | 1.00 | 1.13 | 1.17 |
| CART (l) | 85.3 | 90.8 | 61.5 | 51.1 | 42.2 | 40.3 | 0.27 | 0.17 | 0.85 | 1.08 | 1.24 | 1.39 |
Discussion
Summary of findings. This study reviewed the literature of healthcare cost prediction and found that cost on cost prediction performs as well or better than cost prediction using clinical data or clinical data plus cost data. Moreover, supervised learning methods were found to be superior in predictive ability. Moreover, we found that gradient boosting provides the best cost on cost prediction models in general, with ANN providing superior performance for higher cost patients. The evaluations show consistency between training and validation results.
Strengths. An important strength of this study is that we combined both a systematic literature review and a head-to-head empirical evaluation of different supervised learning methods reported in the literature. An additional strength is that we evaluated state-of-the-art supervised learning methods not previously evaluated in the literature for cost on cost prediction in health care.
Limitations. The main limitation of this study is that we just used one data set. More experiments on different data sets from different institutions and regions could provide more solid evidence on the comparative performance of different algorithms. The second limitation of this study is that we just used cost features. Although previous studies showed that medical features did not improve the performance of the cost models, we could potentially still benefit from such features for two reasons. One is that the new supervised machine learning methods may benefit from the medical features. Second is that the medical features have more explanatory power that may help decision makers understand the root causes of members’ costs.
Future studies. This study was devoted to the paid amount of the medical claims. An interesting venue of research would be analyzing the billed amount as well as the out-of-pocket amount paid by patients to see which approaches work best for each type of cost metric. Another future research direction would be to explore the use of more advanced supervised learning methods such as deep learning and structure analysis to improve the performance of cost prediction methods. Finally, adding medical features and benefiting from their predictive and explanatory power can be another future research direction, which has already been started in our team.
Conclusion
The literature indicates that the preferred approach to healthcare cost prediction is cost on cost prediction using supervised learning methods. Empirical analysis of alternate approaches using data from a single health insurer found that gradient boosting provides the best cost on cost prediction models in general, with ANN providing superior performance for higher cost patients.
Table 7.
Performance comparison among different supervised learning models for categorial measures on the training data set. Models that are annotated with (l) have been used in the cost on cost prediction literature before (see Table 3), while those annotated with (n) are new to this study.
| Hit Ratio (%) | Penalty Error | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | All | 1 | 2 | 3 | 4 | 5 | All | 1 | 2 | 3 | 4 | 5 |
| GradientBoosting (l) | 94.8 | 98.7 | 74.3 | 64.9 | 52.7 | 37.5 | 0.17 | 0.10 | 0.63 | 0.89 | 1.07 | 1.10 |
| ANN (n) | 91.5 | 96.1 | 71.8 | 56.8 | 56.9 | 51.8 | 0.20 | 0.13 | 0.67 | 0.95 | 0.97 | 0.96 |
| Ridge (n) | 91.6 | 94.9 | 71.5 | 58.9 | 55.8 | 51.9 | 0.20 | 0.13 | 0.69 | 0.91 | 0.97 | 0.95 |
| SVM (n) | 91.1 | 94.7 | 69.7 | 57.9 | 54.2 | 50.7 | 0.20 | 0.13 | 0.67 | 0.90 | 1.06 | 1.09 |
| ElasticNet (n) | 90.9 | 94.7 | 68.9 | 57.1 | 52.3 | 49.9 | 0.20 | 0.13 | 0.67 | 0.90 | 1.11 | 1.14 |
| Lasso (l) | 90.8 | 94.7 | 68.5 | 57.1 | 51.9 | 49.7 | 0.20 | 0.13 | 0.67 | 0.91 | 1.10 | 1.15 |
| M5 (l) | 90.6 | 94.6 | 67.9 | 56.3 | 51.9 | 49.7 | 0.20 | 0.13 | 0.67 | 0.91 | 1.10 | 1.15 |
| LinearRegression (l) | 90.1 | 94.2 | 67.9 | 56.1 | 51.9 | 47.5 | 0.20 | 0.13 | 0.66 | 0.91 | 1.10 | 1.14 |
| RandomForest (l) | 88.8 | 93.7 | 62.9 | 48.9 | 50.9 | 46.6 | 0.23 | 0.15 | 0.71 | 1.05 | 1.08 | 1.10 |
| Bagging (n) | 86.7 | 93.6 | 62.7 | 52.1 | 47.6 | 43.6 | 0.23 | 0.15 | 0.71 | 0.98 | 1.11 | 1.15 |
| CART (l) | 86.1 | 93.2 | 62.5 | 53.0 | 44.4 | 41.6 | 0.25 | 0.16 | 0.82 | 1.05 | 1.19 | 1.31 |
Table 8.
Performance comparison among different supervised learning models for numeric measures on the validation data set. Models that are annotated with (l) have been used in the cost on cost prediction literature before (see Table 3), while those annotated with (n) are new to this study.
| MAPE | R2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | All | 1 | 2 | 3 | 4 | 5 | All | 1 | 2 | 3 | 4 | 5 |
| GradientBoosting(l) | 0.65 | 0.76 | 0.63 | 0.60 | 0.59 | 0.54 | 0.46 | 0.04 | 0.11 | 0.15 | 0.13 | 0.32 |
| ANN (n) | 0.7 | 0.84 | 0.69 | 0.66 | 0.52 | 0.45 | 0.44 | 0.02 | 0.07 | 0.11 | 0.25 | 0.44 |
| Ridge (n) | 0.71 | 0.85 | 0.70 | 0.67 | 0.51 | 0.44 | 0.41 | 0.02 | 0.07 | 0.11 | 0.27 | 0.43 |
| SVM (n) | 0.78 | 1.00 | 0.72 | 0.68 | 0.58 | 0.52 | 0.41 | 0.02 | 0.07 | 0.12 | 0.20 | 0.36 |
| Elastic Net (n) | 0.8 | 1.06 | 0.73 | 0.68 | 0.60 | 0.53 | 0.40 | 0.02 | 0.07 | 0.12 | 0.16 | 0.30 |
| Lasso (l) | 0.83 | 1.14 | 0.74 | 0.68 | 0.60 | 0.53 | 0.40 | 0.02 | 0.07 | 0.12 | 0.16 | 0.31 |
| M5 (l) | 0.83 | 1.15 | 0.74 | 0.69 | 0.58 | 0.55 | 0.40 | 0.02 | 0.07 | 0.12 | 0.16 | 0.31 |
| LinearRegression(l) | 0.83 | 1.16 | 0.74 | 0.68 | 0.60 | 0.53 | 0.40 | 0.02 | 0.07 | 0.12 | 0.16 | 0.31 |
| RandomForest(l) | 0.91 | 1.17 | 0.9 | 0.77 | 0.75 | 0.58 | 0.40 | 0.02 | 0.05 | 0.13 | 0.08 | 0.34 |
| Bagging (n) | 0.9 | 1.16 | 0.88 | 0.80 | 0.68 | 0.55 | 0.39 | 0.01 | 0.04 | 0.09 | 0.09 | 0.34 |
| CART (l) | 0.98 | 1.23 | 1.01 | 0.83 | 0.77 | 0.66 | 0.29 | 0.01 | 0.03 | 0.02 | 0.03 | 0.18 |
References
- 1.The Centers for Medicare & Medicaid Services (CMS) DoHaHS, United States. National Health Expenditure Data 2016. Available from: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics- Trends-and-Reports/NationalHealthExpendData/index.html.
- 2.Duncan I, Loginov M, Ludkovski M. Testing Alternative Regression Frameworks for Predictive Modeling of Health Care Costs. North American Actuarial Journal. 2016;20(1):65–87. [Google Scholar]
- 3.Burwell SM. Setting value-based payment goals--HHS efforts to improve US health care. 2015 doi: 10.1056/NEJMp1500445. [DOI] [PubMed] [Google Scholar]
- 4.Lahiri C, Agarwal N. Predicting healthcare expenditure increase for an individual from medicare data. Proceedings of the ACM SIGKDD Workshop on Health Informatics. 2014 [Google Scholar]
- 5.Montori VM, Wilczynski NL, Morgan D, Haynes RB. Optimal search strategies for retrieving systematic reviews from Medline: analytical survey. Bmj. 2005;330(7482):68. doi: 10.1136/bmj.38336.804167.47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bertsimas D, Bjarnadóttir MV, Kane MA, Kryder JC, Pandey R, Vempala S, et al. Algorithmic prediction of health-care costs. Operations Research. 2008;56(6):1382–92. [Google Scholar]
- 7.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996:267–88. [Google Scholar]
- 8.Muniz G, Kibria BG. On some ridge regression estimators: An empirical comparisons. Communications in Statistics—Simulation and Computation®. 2009;38(3):621–30. [Google Scholar]
- 9.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–20. [Google Scholar]
- 10.Timofeev R. Classification and regression trees (CART) theory and applications: Humboldt University, Berlin. 2004 [Google Scholar]
- 11.Loh WY. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(1):14–23. doi: 10.1002/widm.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22. [Google Scholar]
- 13.Sutton CD. 11-Classification and Regression Trees, Bagging, and Boosting. Handbook of statistics. 2005;24:303–29. [Google Scholar]
- 14.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2011;2(3):27. [Google Scholar]
- 15.Yegnanarayana B. Artificial neural networks: PHI Learning Pvt. Ltd. 2009 [Google Scholar]
- 16.Demsar J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research. 2006;7:1–30. [Google Scholar]
- 17.Kronick R, Gilmer T, Dreyfus T, Ganiats T. CDPS-Medicare: The chronic illness and disability payment system modified to predict expenditures for Medicare beneficiaries. Final Report to CMS. 2002 [Google Scholar]
- 18.Sushmita S, Newman S, Marquardt J, Ram P, Prasad V, Cock MD, et al. Population cost prediction on public healthcare datasets. Proceedings of the 5th International Conference on Digital Health 2015; ACM; 2015. [Google Scholar]
- 19.Jones AM. Models for health care: University of York. Centre for Health Economics. 2010 [Google Scholar]
- 20.Mihaylova B, Briggs A, O’Hagan A, Thompson SG. Review of statistical methods for analysing healthcare resources and costs. Health economics. 2011;20(8):897–916. doi: 10.1002/hec.1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.König H-H, Leicht H, Bickel H, Fuchs A, Gensichen J, Maier W, et al. Effects of multiple chronic conditions on health care costs: an analysis based on an advanced tree-based regression model. BMC health services research. 2013;13(1):1. doi: 10.1186/1472-6963-13-219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lee S-M, Kang J-O, Suh Y-M. Comparison of hospital charge prediction models for colorectal cancer patients: neural network vs. decision tree models. Journal of Korean medical science. 2004;19(5):677–81. doi: 10.3346/jkms.2004.19.5.677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Powers CA, Meyer CM, Roebuck MC, Vaziri B. Predictive modeling of total healthcare costs using pharmacy claims data: a comparison of alternative econometric cost modeling techniques. Medical care. 2005;43(11):1065–72. doi: 10.1097/01.mlr.0000182408.54390.00. [DOI] [PubMed] [Google Scholar]
- 24.Guo X, Gandy W, Coberley C, Pope J, Rula E, Wells A. Predicting health care cost transitions using a multidimensional adaptive prediction process. Population health management. 2015;18(4):290–9. doi: 10.1089/pop.2014.0087. [DOI] [PubMed] [Google Scholar]
- 25.Kuo RN, Dong Y-H, Liu J-P, Chang C-H, Shau W-Y, Lai M-S. Predicting healthcare utilization using a pharmacy-based metric with the WHO’s anatomic therapeutic chemical algorithm. Medical care. 2011;49(11):1031–9. doi: 10.1097/MLR.0b013e31822ebe11. [DOI] [PubMed] [Google Scholar]
- 26.Frees EW, Jin X, Lin X. Actuarial applications of multivariate two-part regression models. Annals of Actuarial Science. 2013;7(02):258–87. [Google Scholar]
