Table 3.
Sensitivity analysis for various high-cost user thresholds: predictive model performance
| Prediction models | 30% high-cost users prevalence | 20% prevalence (the base case) | 10% prevalence | 5% prevalence | ||||
|---|---|---|---|---|---|---|---|---|
| Sensitivitya | F1d | Sensitivitya | F1d | Sensitivitya | F1d | Sensitivitya | F1d | |
| Traditional regression models | ||||||||
| All conventional variables (TRM1)e | 17.9% | 26.4% | 4.9% | 9.1% | * | * | * | * |
| As per TRM1 but no ethnicity variables (TRM2) | 16.5% | 25.8% | 4.9% | 9.0% | * | * | * | * |
| As per TRM2 but no smoking variables (TRM3) | 16.3% | 25.6% | 4.6% | 8.6% | * | * | * | * |
| Machine learning modelsf | ||||||||
| Random forest | 45.2% | 49.3% | 37.8% | 41.2% | 29.9% | 32.6% | 25.6% | 28.5% |
| KNN | 45.7% | 46.5% | 38.0% | 39.0% | 29.2% | 30.1% | 25.2% | 26.0% |
| L1-regularised logistic regression | 75.2% | 50.9% | 78.9% | 34.5% | 72.5% | 21.0% | 76.2% | 25.0% |
| Classification trees | 46.1% | 55.3% | 19.5% | 30.6% | 11.4% | 19.8% | 10.9% | 19.5% |
Note: aResults produced from the model were unstable due to a small number of CVD events in relation to the total observations
a, b, c, d, e, f: see Table 2
The results for the traditional regression model as per TRM3 but no chronic condition variables were not reported as this model had very poor predictive power