(a) Model performance when trained before the time
point tP and tested after
tP, both on the entirety of the future
patient population as well as subgroups of patients for which the model has
or hasn’t seen historical information during training. The model
maintains a comparable level of performance on unseen future data, with a
higher level of sensitivity of 59% for a time window of 48 hours ahead of
time and a precision of two false positives per step for each true positive.
The ranges correspond to bootstrap pivotal 95% confidence intervals with
n=200. Note that this experiment is not a replacement for a prospective
evaluation of the model. (b) Cohort statistics for
(a), shown for both before and after the temporal split tP
that was used to simulate model performance on future data. (c)
Comparison of model performance when applied to data from previously unseen
hospital sites. Data was split across sites so that 80% of the data was in
group A and 20% in group B. No site from
group B was present in group A and vice
versa. The data was split into training, validation, calibration and test in
the same way as in the other experiments. The table reports model
performance when trained on site group A when evaluating on
the test set within site group A versus the test set within
site group B for predicting all AKI severities up to 48
hours ahead of time. Comparable performance is seen across key all key
metrics. 95% bootstrap pivot confidence intervals are calculated using n=200
bootstrap samples. Note that the model would still need to be retrained to
generalise outside of the VA population to a different demographic and a
different set of clinical pathways and hospital processes elsewhere.