Machine Learning Model Drift: Predicting Diagnostic Imaging Follow-Up as a Case Example

Ronilda Lacson; Mahsa Eskian; Andro Licaros; Neena Kapoor; Ramin Khorasani

doi:10.1016/j.jacr.2022.05.030

. Author manuscript; available in PMC: 2026 Jan 11.

Published in final edited form as: J Am Coll Radiol. 2022 Aug 16;19(10):1162–1169. doi: 10.1016/j.jacr.2022.05.030

Machine Learning Model Drift: Predicting Diagnostic Imaging Follow-Up as a Case Example

Ronilda Lacson ^1,², Mahsa Eskian ¹, Andro Licaros ¹, Neena Kapoor ^1,², Ramin Khorasani ^1,²

PMCID: PMC12790260 NIHMSID: NIHMS2124698 PMID: 35981636

Abstract

Objective:

Address model drift in a machine learning (ML) model for predicting diagnostic imaging follow-up using data augmentation with more recent data versus retraining new predictive models.

Methods:

This Institutional Review Board–approved retrospective study was conducted 1/1/2016–12/31/2020 at a large academic institution. A previously-trained ML model was trained on 1000 radiology reports from 2016 (old data). An additional 1385 randomly-selected reports from 2019–2020 (new data) were annotated for follow-up recommendations and randomly divided into two sets: training (n=900) and testing (n=485). Support vector machine and random forest (RF) algorithms were constructed and trained using 900 new data reports plus old data (augmented data, new models) and using only new data (new data, new models). The 2016 baseline model was used as comparator “as is” and trained with augmented data. Recall was compared with baseline using McNemar’s test.

Results:

11.3% of reports (157/1385) contained follow-up recommendations. The baseline model retrained with new data had precision=0.83, and recall=0.54; none significantly different from baseline. A new RF model trained with augmented data had significantly better recall vs. the baseline model (0.80 vs. 0.66, p=0.04) and comparable precision (0.90 vs. 0.86).

Discussion:

ML methods for monitoring follow-up recommendations in radiology reports suffer model drift over time. A newly-developed RF model achieved better recall with comparable precision vs. simply retraining a previously-trained original model with augmented data. Thus, regularly assessing and updating these models is necessary using more recent historical data.

Keywords: Diagnostic Imaging, Machine Learning, Model Drift

Summary Statement:

When machine learning is being used to monitor follow-up recommendations, careful attention to model drift is necessary and requires monitoring machine learning models and addressing the drift, when it exists.

Introduction

Evidence-based recommendations for follow-up testing made by the radiologist to the patient and to other providers need to be followed diligently to avoid delayed or missed diagnosis (diagnostic error).(1) Conversely, patients who do not require follow-up should not be exposed to the risks of an unnecessary imaging test.(2–5) Some recommendations are based on strong evidence from published studies; others are based solely on opinion.(6, 7) Quality initiatives to ensure appropriateness of follow-up recommendations rely on identifying reports containing such recommendations. However, follow-up recommendations are noted by radiologists as free text within the radiology report. It is challenging to identify and extract them given the varied language radiologists use to communicate them.(8)

We previously published a machine learning algorithm for identifying radiology reports with follow-up recommendations.(32) This algorithm has been used at our institution for several years to: 1) benchmark recommendations for follow-up of diagnostic imaging findings including pulmonary nodules,(27) and 2) assess variation in follow-up recommendations by subspecialty.(8) Recently, we noted that the accuracy of the machine learning model has degraded over time. This known phenomenon – model drift – has been reported repeatedly in machine learning models in healthcare and other domains.(9–11)

Model drift reflects the observation that no model lasts forever. It consists of data drift and/or concept drift. Data drift reflects changes in the input data. When the distribution of variables in the new data is meaningfully different, the trained model does not perform as well.(12) Data drift may occur when the demographics of the patient population evolves over time or new practices bring new populations to a health system.(10, 13) For example, the radiologists within a practice or institution may change over time, thus changing distribution of follow-up recommendations. Concept drift, on the other hand, reflects changes in relationships between the model inputs and outputs. When the patterns the trained model learned no longer hold even when the data distribution remain the same, then the model’s performance suffers.(11, 14, 15) This occurs when changes in external factors influence recommendations. For instance, a new system for follow-up may be initiated and can impact how recommendations are stated.(16) In addition, evidence evolves, and new recommendations may be established for diagnosis or screening.(17–19)

Model drift occurs in healthcare models(10, 20) as well as predictive models used in industrial and commercial settings.(12, 21) Model drift also occurs in interpretive machine learning algorithms (e.g., machine vision for lesion detection), which uses deep learning algorithms. Image classification is subject to several temporal data perturbations which may lead to classification deterioration.(22, 23) Perturbations include improved 3D from 2D imaging (i.e., mammography) and stronger magnets, gradients and improved receiver coils (i.e., for MRI). Although regularly assessing and updating these models is necessary to ensure accurate performance, there is no standard approach to addressing model drift.

This study aimed to address model drift in a machine learning model used to predict diagnostic imaging follow-up by retraining the original model using data augmented with more recent data versus retraining new predictive models. To assess their performance, we sought to compare them with the performance of the original validated machine learning model. Secondary objectives include evaluating the performance of the original machine learning model over time and assessing current rate of follow-up recommendations in radiology reports.

Methods

Study Setting and Cohort

This Institutional Review Board–approved retrospective study was conducted at a large academic institution using data collected between 1/1/2016 to 12/31/2020. Reports were extracted from the institution’s radiology information system from among X-ray, CT, MR, ultrasound, and nuclear medicine studies. The institutional database has over 600,000 inpatient, outpatient, and Emergency Department reports annually. We needed a minimum sample size of 295 in the test set to detect whether there was greater than 10% difference in the precision and recall of our previously-trained baseline model of 0.88 and 0.82, respectively with 95% confidence level and 80% power.(32) A total of 1,385 radiology text reports were extracted randomly from data collected between 2019–2020 to have a 65:35 training-testing split (900 training, 485 testing) for re-training the baseline model and developing a new one. Additionally, 200 reports each from 2018, 2019, and 2020 were randomly-selected to describe the performance of the baseline model annually. In addition to the radiology text report content, study modality, patient care setting, patient date of birth, and patient gender were also collected.

A previously-annotated data set was also used, including a prior model trained on this data set. The previously trained machine learning model was trained on 1000 reports from 1/1/2016 to 12/31/2016 (“old data”), as previously described.(32)

Manual Data Annotation

The radiology reports were manually labeled for the presence of one or more follow-up recommendations by two research fellows (ME, AL). They both annotated 60 radiology reports from the 2019–2020 data set to assess interrater agreement using the kappa statistic. Recommendations for follow-up were defined as any phrase that might reasonably and explicitly suggest further imaging (e.g., CT scan) or procedural intervention (e.g., biopsy). Phrases suggesting clinical correlation or clinical review were not considered follow-up. For example, “suggest short interval follow-up” and “recommend chest CT” constituted follow-up phrases; “clinical correlation recommended” and “recommend comparison with prior imaging” did not. Follow-up rate is defined as the number of reports with documented follow-up divided by the total reports annotated.

The newly annotated corpus was then randomly divided for training and validation using 900 reports (training data) and testing using 485 reports (test data). We report the follow-up rate for the training and test data.

Four data sets were used in the study – (1) newly annotated training data (n=900, “new training data”), (2) newly annotated test data (n=485, “new test data”), (3) old data from 2016 used as training data (n=1000, “old training data”), and (4) a combination of (1) and (3) also used as training data (n=1900, “augmented training data”). The new training data were used to train newly developed models. The augmented training data were used to re-train the baseline model and train newly developed models.

Feature Extraction

Text from radiology reports was used as features in developing the machine learning algorithms. We have learned that follow-up recommendations are almost always in the Impression section of radiology reports even if these are also mentioned in the Findings section. Therefore, we restricted our text features to text from the Impression section to reduce computational complexity. Text data were converted into a bag-of-words representation using the scikit-learn CountVectorizer utility and scaled to frequencies using the TfidfTransformer utility.(24) CountVectorizer converts text into a matrix of term (e.g., word) occurrence counts, and TfidfTransformer scales each term using term frequency multiplied by the inverse of the document frequency (i.e., number of reports containing the word). CountVectorizer can treat multiple words as single units (i.e., pairs of successive words are 2-grams), applying n-grams to preserve sequence information. For all machine learning algorithms, we included increasing n-grams as a tuned hyperparameter.

Machine Learning Re-Training and New Models

The baseline model uses a support vector machine (SVM) algorithm that was previously trained and validated. We used the baseline model, as previously developed, and tested it on the new test data without changing any hyperparameter values. In addition, we re-trained the baseline SVM algorithm with augmented training data and tested it on the new test data.

We then developed new SVM and random forest (RF) models using the scikit-learn Python application programming interface. These models were trained using the new training data as well as the augmented training data. All models were first evaluated with default hyperparameters using 10-fold cross-validation. Hyperparameters for all models were tuned using scikit-learn’s grid search (GridSearchCV utility), with optimized models re-evaluated using10-fold cross-validation. Finally, the best of 10 models were trained on the entirety of the training data (one each for models trained on the new training data and the augmented training data) and tested for generalizability on the new test data.

SVM is a classification algorithm that represents samples as data points in high dimensional space. The goal for the classification task in SVM is to identify hyperplanes with associated wide margins to separate these data points into categories. Data points close to the dividing plane are called support vectors. Hyperparameters include the kernel, which maps the data into some feature space. Ideally the observations are more easily separable after this transformation. The standard kernels include a linear kernel, a polynomial kernel, and a radial basis function kernel. For the polynomial kernel, more hyperparameters include degree of the polynomial (i.e., degree) as well as the coefficient (i.e., “coef”), which controls how much the model is influenced by high-degree polynomials. Finally, the cost hyperparameter (i.e., “c”) represents the cost assigned by the algorithm to misclassification of data when building the model.

The RF classification algorithm relies on an ensemble of decision trees, averaging their results to make a decision. Each decision tree learns to make predictions based on a random subset of input features, splitting data points through hierarchical nodes (e.g., by minimizing entropy) until reaching a classification. Hyperparameters include the number of decision trees to be used in the random forest (i.e., estimators). We also tune the minimum number of samples required to be at a leaf node (i.e., min_samples_leaf) as well as the number of features to consider when looking for the best split (i.e., max_features), which can be square root, log or simply the number of available features (i.e., “none” or no transformation). We used Gini impurity as the splitting criteria.

Validation Framework

All final model performances were evaluated on the test data (n=485) using measures of precision, recall and F1 score. Precision is also known as positive predictive value and is the fraction of true positive instances among the retrieved instances. Recall is also known as sensitivity and refers to all true positives out of the total number of positives. F1 score is the harmonic mean of precision and recall. As opposed to the arithmetic mean, the harmonic mean is used for rates because it minimizes the effect of extreme values. It is measured as the reciprocal of the arithmetic mean of the reciprocals of precision and recall.

We computed 95% confidence intervals around precision and recall. We compared the recall of all models to the baseline model using McNemar’s test.

Performance of Baseline Model over Time

We measured the impact of model drift over time on our baseline model using test data over time (2018, 2019 and 2020). In addition, we also measured the follow-up recommendation rate annually.

Results

Report Cohorts

The training set data consisted of reports from 295 X-rays, 217 CT, 121 MR, 121 US, and 146 nuclear medicine examinations. There were 148 x-rays, 119 CT, 60 MR, 74 US, and 84 nuclear medicine examinations for the test set. The characteristics of the training and test set are shown in Table 1.

Table 1:

Characteristics of Training and Test Data Sets

Characteristics	Training [n=900] (%)	Test [n=485] (%)
Patient Age (y)	59.2 [sd 17.4]	58.7 [sd 17.2]
Patient Sex
Female	522 (58.0)	281 (57.9)
Male	378 (42.0)	204 (42.1)
Care Setting
Outpatient	635 (70.6)	353 (72.8)
Inpatient	163 (18.1)	91 (18.8)
Emergency Department	102 (11.3)	41 (8.4)
Modality
X-ray	295 (32.8)	148 (30.5)
CT	217 (24.1)	119 (24.5)
MR	121 (13.4)	60 (12.4)
US	121 (13.4)	74 (15.3)
Nuclear Medicine	146 (16.2)	84 (17.3)
Follow-up recommendations	101 (11.2)	56 (11.5)

Open in a new tab

sd: standard deviation

The sixty reports annotated by two annotators had a percentage agreement of 98.3% and a kappa statistic of 0.97 (95% confidence interval: 0.90–1.00).

Model Performances

The baseline model’s performance on new test data had precision of 0.86 and recall of 0.66, with F1 score of 0.75. When trained with augmented data including more recently annotated training data, the precision and recall were 0.83 and 0.54, respectively. Both decreased although these were not statistically significant. Comparing the newly trained RF and SVM algorithms to baseline, only one model showed significantly improved performance compared to the baseline model – the RF model trained using augmented data (Table 2). This model had 50 estimators and was optimized with a 1- to 6-gram bag-of-words representation with five minimum samples per leaf node.

Table 2:

Results Comparing Models on Test Data (n=485)^a

Model	# Data for Training	Algorithm	Hyperparameters	Precision (%, [95% Confidence Interval])	Recall (%,[95% Confidence Interval])	p-value^**	F1
Baseline	1000^b	SVM	c=100, coef=10, poly, degree=7, n-gram (1,3)	37/43 (0.86, [0.73–0.93])	37/56 (0.66, [0.52–0.78])	--	0.75
Baseline	1900^c	SVM	c=100, coef=10, poly, degree=7, n-gram (1,3)	30/36 (0.83, [0.68–0.92])	30/56 (0.54, [0.40–0.67])	0.07	0.65
New	900^a	SVM	c=100, coef=10, poly, degree=3, n-gram (1,1)	27/32 (0.84, [0.68–0.93])	27/56 (0.48, [0.35–0.62])	0.02^#	0.61
New	1900^c	SVM	c=1, coef=10, poly, degree=7, n-gram (1,6)	34/53 (0.64, [0.52–0.73])	34/56 (0.61, [0.47–0.74])	0.61	0.62
New	900^a	RF	max_features=none, min_samples_leaf=5, estimators=10, n-gram (2,2)	32/39 (0.82, [0.68–0.91])	32/56 (0.57, [0.43–0.70])	0.38	0.67
New	1900^c	RF	max_features=none, min_samples_leaf=5, estimators=50, n-gram (1,6)	45/50 (0.90, [0.79–0.96])	45/56 (0.80, [0.68–0.90])	0.04^#	0.85

Open in a new tab

new data (1/1/2019 to 12/31/2020)

old data (1/1/2016–12/31/2016)

augmented data (combined old data and new data)

^**

McNemar’s test

statistically significant

c: cost for misclassification

coef: polynomial coefficient

poly: polynomial kernel

degree: polynomial degree

max_features: number of features to consider

min_samples_leaf: minimum number of samples at a leaf node

estimators: number of decision trees to be used in the random forest

Compared to the baseline model (precision=0.86), the RF model had precision of 0.90 (p=0.75) and had better recall at 0.80 (baseline recall=0.66, p=0.04). The F1 score was 0.85.

Baseline Model Performance over Time

Precision and recall of the baseline machine learning model over time is illustrated in Figure 1. Precision steadily decreased, although changes in recall were more pronounced.

Rate of Follow-up Recommendations

Manual annotations revealed 157/1385 reports (11.3%) with follow-up recommendations. The follow-up rate was 11.2% in the training set and 11.5% in the test set. Annual follow-up rates since 2016 are depicted in Table 3. They fluctuated between 9.0% and 15.5% between 2018 and 2020, although not statistically significant from the 12.7% follow-up rate in 2016.

Table 3:

Baseline Model Performance over Time

Year	Precision [95% Confidence Interval]	Recall [95% Confidence Interval]	Follow-up %	p-value
2016^*	0.88 [0.66–0.97]	0.82 [0.57–0.96]	12.7%	Reference
2018	0.75 [0.47–0.91]	0.50 [0.26–0.74]	9.0%	0.14
2019	0.70 [0.48–0.86]	0.60 [0.36–0.81]	10.0%	0.29
2020	0.68 [0.52–0.80]	0.68 0.49–0.83]	15.5%	0.29

Open in a new tab

previously reported (32)

Discussion

A predictive model for predicting diagnostic imaging follow-up using machine learning algorithms degraded over time, resulting in lower precision and recall. A newly developed RF model trained with augmented data was noted to achieve better precision and recall. This model achieved similar performance to that of the original model developed previously, which is desirable when using artificial intelligence (AI) to support and monitor quality initiatives in real time.

Monitoring follow-up recommendations is informative for evaluating guideline adherence,(25) provider agreement,(8, 18) follow-up compliance,(26, 27) and ultimately, outcomes of timely diagnosis and provider/patient satisfaction.(28, 29) Use of machine learning is critical in automatic assessment of radiology report follow-up recommendations, given the large number of imaging tests performed in the United States each year.(30) Seamless integration into electronic health record systems can reduce unwarranted variation in radiologists’ follow-up recommendations, and thus reduce varied follow-up imaging, additional physician consultations, and increased health care costs.(31)

When using machine learning models to monitor follow-up recommendations, it is important to monitor for model drift and to address the drift when it exists. In a previous study, the authors recommended only modest model recalibration when there are case mix shifts but recommended intercept correction or completely refitting the model when there are shifts in predictor-outcome associations or outcome rates.(10) In our case, the outcome rates did not change; follow-up recommendations noted in radiology reports were present in 11.3% of reports, unchanged from 12.7% reported previously in 2016.(32) While we attempted to retrain our previous model using more recent data, this did not change the model’s diminished performance. Even with no change in outcome rates, we had to train a completely new model.

A new RF algorithm performed better than a retrained SVM algorithm for identifying follow-up recommendations. As demonstrated previously, machine learning models are known to demonstrate variations in performance for different tasks.(32–35) This highlights the need to comprehensively and methodically try different approaches. This includes using the same algorithm by recalibrating the same model as well as refitting the model with new data. Data analytic or machine learning expertise is necessary to perform model updating (i.e., refitting or recalibration). In general, when only small amounts of data are available for updating, modest recalibration is recommended. When the data is large enough (i.e., the updating data is larger than the original training data), it is best to refit the model. Finally, retraining a new model may be necessary when clinically acceptable performance is not achieved with refitting or recalibration. Retraining a completely new model is more labor- and time-intensive and requires greater expertise than updating.

Ideally, systems for oversight and maintenance should be in place to monitor machine learning model performance over time.(31) However, this is not always possible given competing needs for informatics infrastructure in healthcare.(36–38) Therefore, we suggest that machine learning models be reviewed and revalidated whenever there are: 1) changes in the population distribution, 2) changes in the data surrounding the model (e.g., a new evaluative test prior to an imaging examination),(39) 3) data quality issues (e.g., standardizing expression of follow-up uncertainty),(40) or 4) changes in the distribution or rate of the target outcome. In addition, previous studies demonstrate model degradation from between 3–12 months(41, 42) (depending on data used for training and model complexity) to up to 3 years.(10, 43) Drift detection systems can be programmed to monitor model drift between scheduled maintenance points for models used in clinical practice.(9) Thus, monitoring processes and time intervals need to be explicitly defined whenever new algorithms are developed or acquired for clinical practice or quality initiatives to be able to address model drift.

Retraining machine learning models requires manual and time-consuming review of reports. However, model drift may be addressed by periodically collecting outcomes from models that are implemented operationally over time. Facilitating prospective annotations of these predicted outcomes can provide model oversight and expedite updating of these models. Data annotation has historically been the bottleneck in AI development,(41, 44) and ensuring availability of labeled outcomes data will enhance our capability to closely monitor AI performance. In addition to lack of labeled data, another impediment to robust algorithms include unstructured and unreliable data from medical records, including radiology reports. Ensuring that AI algorithms can use continuous learning approaches by self-selecting features from unstructured data may mitigate the need for frequent model refitting or recalibration.(41) Instead, continuous learning AI algorithms can be trained with larger data sets from longitudinally acquired data utilizing unstructured features and readily-available labeled outcomes. Monitoring and model enhancements can be accomplished with built-in anomaly and out-of-distribution detectors for model features and outcomes.(41)

Anomaly detection and out-of-distribution detection algorithms have been used for various AI algorithms. Out-of-distribution detection algorithms note changes in data distributions while anomaly detection notes whenever outcomes or features significantly deviates from the normal. In our case, we did not note out-of-distribution or anomalous outcomes. Rather, we noted significant degradation in accuracy. We monitored precision and recall of the machine learning models, which are standard accuracy measures in information extraction.(45)

Limitations of this study include model development and evaluation in a single academic medical center which may also limit its generalizability. In addition, our model was trained only to identify follow-up recommendations, whenever they are documented in the reports. The actual clinical finding being followed, the follow-up time frame, and suggested examination modality are not identified for future quality initiatives that may impact follow-up timeliness and diagnostic delays. Identifying the clinical findings being followed and other follow-up details could affect model accuracy and potentially contribute to model drift. Finally, we did not retrain a model with other machine learning algorithms (e.g., neural network) and newer deep learning algorithms,(46, 47) which may further enhance predictive performance, as well as with increasing amounts of training data.

Take-Home Points.

Machine learning models being utilized to monitor follow-up recommendations in radiology reports suffer from model drift over time.
Machine learning models require regular oversight and over time, updating these models is necessary using more recent data.
Although recalibration or refitting models with new data may be sufficient in some cases, training and validating new machine learning algorithms may be necessary in other cases to address model drift and ensure predictive model performance in identifying outcomes such as documentation of follow-up recommendations in radiology reports.

REFERENCES

1.IOM. National Academies of Medicine. Improving Diagnosis in Health Care. http://www.nationalacademies.org/hmd/~/media/Files/Report%20Files/2015/Improving-Diagnosis/DiagnosticError_ReportBrief.pdf. Last accessed April 2019. 2015.
2.Meth MJ, Maibach HI. Current understanding of contrast media reactions and implications for clinical management. Drug Saf. 2006;29(2):133–41. [DOI] [PubMed] [Google Scholar]
3.Zou Z, Zhang HL, Roditi GH, Leiner T, Kucharczyk W, Prince MR. Nephrogenic systemic fibrosis: review of 370 biopsy-confirmed cases. JACC Cardiovasc Imaging. 2011;4(11):1206–16. [DOI] [PubMed] [Google Scholar]
4.Sodickson A, Baeyens PF, Andriole KP, Prevedello LM, Nawfel RD, Hanson R, et al. Recurrent CT, cumulative radiation exposure, and associated radiation-induced cancer risks from CT of adults. Radiology. 2009;251(1):175–84. [DOI] [PubMed] [Google Scholar]
5.Griffey RT, Sodickson A. Cumulative radiation exposure and cancer risk estimates in emergency department patients undergoing repeat or multiple CT. AJR Am J Roentgenol. 2009;192(4):887–92. [DOI] [PubMed] [Google Scholar]
6.Varada S, Lacson R, Raja AS, Ip IK, Schneider L, Osterbur D, et al. Characteristics of knowledge content in a curated online evidence library. J Am Med Inform Assoc. 2018;25(5):507–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lacson R, Raja AS, Osterbur D, Ip I, Schneider L, Bain P, et al. Assessing Strength of Evidence of Appropriate Use Criteria for Diagnostic Imaging Examinations. J Am Med Inform Assoc. 2016;23(3):649–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Cochon LR, Kapoor N, Carrodeguas E, Ip IK, Lacson R, Boland G, et al. Variation in Follow-up Imaging Recommendations in Radiology Reports: Patient, Modality, and Radiologist Predictors. Radiology. 2019:182826. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Davis SE, Greevy RA, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Davis SE, Greevy RA, Fonnesbeck C, Lasko TA, Walsh CG, Matheny ME. A nonparametric updating method to correct clinical prediction model drift. J Am Med Inform Assoc. 2019;26(12):1448–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Liu A, Lu J, Zhang G. Concept Drift Detection via Equal Intensity k-Means Space Partitioning. IEEE Trans Cybern. 2021;51(6):3198–211. [DOI] [PubMed] [Google Scholar]
12.Tripathi S, Muhr D, Brunner M, Jodlbauer H, Dehmer M, Emmert-Streib F. Ensuring the Robustness and Reliability of Data-Driven Knowledge Discovery Models in Production and Manufacturing. Front Artif Intell. 2021;4:576892. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Jenkins DA, Sperrin M, Martin GP, Peek N. Dynamic models to predict health outcomes: current status and methodological challenges. Diagn Progn Res. 2018;2:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wang K, Lu J, Liu A, Zhang G, Xiong L. Evolving Gradient Boost: A Pruning Scheme Based on Loss Improvement Ratio for Learning Under Concept Drift. IEEE Trans Cybern. 2021;PP. [DOI] [PubMed] [Google Scholar]
15.Ross MK, Wei W, Ohno-Machado L. “Big data” and the electronic health record. Yearb Med Inform. 2014;9(1):97–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hammer MM, Kapoor N, Desai SP, Sivashanker KS, Lacson R, Demers JP, et al. Adoption of a Closed-Loop Communication Tool to Establish and Execute a Collaborative Follow-Up Plan for Incidental Pulmonary Nodules. AJR Am J Roentgenol. 2019:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Uhlig K, Berns JS, Carville S, Chan W, Cheung M, Guyatt GH, et al. Recommendations for kidney disease guideline updating: a report by the KDIGO Methods Committee. Kidney Int. 2016;89(4):753–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Nair A, Bartlett EC, Walsh SLF, Wells AU, Navani N, Hardavella G, et al. Variable radiological lung nodule evaluation leads to divergent management recommendations. Eur Respir J. 2018;52(6). [DOI] [PubMed] [Google Scholar]
19.Hammer MM, Palazzo LL, Kong CY, Hunsaker AR. Cancer Risk in Subsolid Nodules in the National Lung Screening Trial. Radiology. 2019;293(2):441–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol. 2008;61(11):1085–94. [DOI] [PubMed] [Google Scholar]
21.AlQabbany AO, Azmi AM. Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams. Entropy (Basel). 2021;23(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Jameel SM, Hashmani MA, Rehman M, Budiman A. An Adaptive Deep Learning Framework for Dynamic Image Classification in the Internet of Things Environment. Sensors (Basel). 2020;20(20). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Guo H, Zhang S, Wang W. Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Netw. 2021;142:437–56. [DOI] [PubMed] [Google Scholar]
24.Pedregosa F, Weiss R, Brucher M. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
25.Lacson R, Prevedello LM, Andriole KP, Gill R, Lenoci-Edwards J, Roy C, et al. Factors associated with radiologists’ adherence to Fleischner Society guidelines for management of pulmonary nodules. J Am Coll Radiol. 2012;9(7):468–73. [DOI] [PubMed] [Google Scholar]
26.Desai SP, Jajoo K, Taber K, Chukwu A, Emani S, Neville BA, et al. A Quality Improvement Intervention Leveraging a Safety Net Model for Surveillance Colonoscopy Completion. Am J Med Qual. 2021. [DOI] [PubMed] [Google Scholar]
27.Desai S, Kapoor N, Hammer MM, Levie A, Sivashanker K, Lacson R, et al. RADAR: A Closed-Loop Quality Improvement Initiative Leveraging A Safety Net Model for Incidental Pulmonary Nodule Management. Jt Comm J Qual Patient Saf. 2021;47(5):275–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kripalani S, LeFevre F, Phillips CO, Williams MV, Basaviah P, Baker DW. Deficits in communication and information transfer between hospital-based and primary care physicians: implications for patient safety and continuity of care. JAMA. 2007;297(8):831–41. [DOI] [PubMed] [Google Scholar]
29.Singal AG, Gupta S, Skinner CS, Ahn C, Santini NO, Agrawal D, et al. Effect of Colonoscopy Outreach vs Fecal Immunochemical Test Outreach on Colorectal Cancer Screening Completion: A Randomized Clinical Trial. JAMA. 2017;318(9):806–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Smith-Bindman R, Miglioretti DL, Johnson E, Lee C, Feigelson HS, Flynn M, et al. Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996–2010. JAMA. 2012;307(22):2400–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kapoor N, Lacson R, Khorasani R. Workflow Applications of Artificial Intelligence in Radiology and an Overview of Available Tools. J Am Coll Radiol. 2020;17(11):1363–70. [DOI] [PubMed] [Google Scholar]
32.Carrodeguas E, Lacson R, Swanson W, Khorasani R. Use of Machine Learning to Identify Follow-Up Recommendations in Radiology Reports. J Am Coll Radiol. 2019;16(3):336–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Hu C, Anjur V, Saboo K, Reddy KR, O’Leary J, Tandon P, et al. Low Predictability of Readmissions and Death Using Machine Learning in Cirrhosis. Am J Gastroenterol. 2021;116(2):336–46. [DOI] [PubMed] [Google Scholar]
34.Arvind V, London DA, Cirino C, Keswani A, Cagle PJ. Comparison of machine learning techniques to predict unplanned readmission following total shoulder arthroplasty. J Shoulder Elbow Surg. 2021;30(2):e50–e9. [DOI] [PubMed] [Google Scholar]
35.Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.(US) IoM. Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. 2011. [PubMed]
37.Mollura DJ, Soroosh G, Culp MP, Group R-ACW. 2016 RAD-AID Conference on International Radiology for Developing Countries: Gaps, Growth, and United Nations Sustainable Development Goals. J Am Coll Radiol. 2017;14(6):841–7. [DOI] [PubMed] [Google Scholar]
38.Rubin DL. Informatics in radiology: Measuring and improving quality in radiology: meeting the challenge with informatics. Radiographics. 2011;31(6):1511–27. [DOI] [PubMed] [Google Scholar]
39.Wells PS, Anderson DR, Rodger M, Ginsberg JS, Kearon C, Gent M, et al. Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer. Thromb Haemost. 2000;83(3):416–20. [PubMed] [Google Scholar]
40.Shinagare AB, Alper DP, Hashemi SR, Chai JL, Hammer MM, Boland GW, et al. Early Adoption of a Certainty Scale to Improve Diagnostic Certainty Communication. J Am Coll Radiol. 2020;17(10):1276–84. [DOI] [PubMed] [Google Scholar]
41.Pianykh OS, Langs G, Dewey M, Enzmann DR, Herold CJ, Schoenberg SO, et al. Continuous Learning AI in Radiology: Implementation Principles and Early Applications. Radiology. 2020;297(1):6–14. [DOI] [PubMed] [Google Scholar]
42.Jin R, Furnary AP, Fine SC, Blackstone EH, Grunkemeier GL. Using Society of Thoracic Surgeons risk models for risk-adjusting cardiac surgery results. Ann Thorac Surg. 2010;89(3):677–82. [DOI] [PubMed] [Google Scholar]
43.Siregar S, Nieboer D, Vergouwe Y, Versteegh MI, Noyez L, Vonk AB, et al. Improved Prediction by Dynamic Modeling: An Exploratory Study in the Adult Cardiac Surgery Database of the Netherlands Association for Cardio-Thoracic Surgery. Circ Cardiovasc Qual Outcomes. 2016;9(2):171–81. [DOI] [PubMed] [Google Scholar]
44.Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Med Inform. 2020;8(3):e17984. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Hersh W Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Brief Bioinform. 2005;6(4):344–56. [DOI] [PubMed] [Google Scholar]
46.Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. NPJ Digit Med. 2021;4(1):93. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Yang X, Bian J, Hogan WR, Wu Y. Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27(12):1935–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.IOM. National Academies of Medicine. Improving Diagnosis in Health Care. http://www.nationalacademies.org/hmd/~/media/Files/Report%20Files/2015/Improving-Diagnosis/DiagnosticError_ReportBrief.pdf. Last accessed April 2019. 2015.

[R2] 2.Meth MJ, Maibach HI. Current understanding of contrast media reactions and implications for clinical management. Drug Saf. 2006;29(2):133–41. [DOI] [PubMed] [Google Scholar]

[R3] 3.Zou Z, Zhang HL, Roditi GH, Leiner T, Kucharczyk W, Prince MR. Nephrogenic systemic fibrosis: review of 370 biopsy-confirmed cases. JACC Cardiovasc Imaging. 2011;4(11):1206–16. [DOI] [PubMed] [Google Scholar]

[R4] 4.Sodickson A, Baeyens PF, Andriole KP, Prevedello LM, Nawfel RD, Hanson R, et al. Recurrent CT, cumulative radiation exposure, and associated radiation-induced cancer risks from CT of adults. Radiology. 2009;251(1):175–84. [DOI] [PubMed] [Google Scholar]

[R5] 5.Griffey RT, Sodickson A. Cumulative radiation exposure and cancer risk estimates in emergency department patients undergoing repeat or multiple CT. AJR Am J Roentgenol. 2009;192(4):887–92. [DOI] [PubMed] [Google Scholar]

[R6] 6.Varada S, Lacson R, Raja AS, Ip IK, Schneider L, Osterbur D, et al. Characteristics of knowledge content in a curated online evidence library. J Am Med Inform Assoc. 2018;25(5):507–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Lacson R, Raja AS, Osterbur D, Ip I, Schneider L, Bain P, et al. Assessing Strength of Evidence of Appropriate Use Criteria for Diagnostic Imaging Examinations. J Am Med Inform Assoc. 2016;23(3):649–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Cochon LR, Kapoor N, Carrodeguas E, Ip IK, Lacson R, Boland G, et al. Variation in Follow-up Imaging Recommendations in Radiology Reports: Patient, Modality, and Radiologist Predictors. Radiology. 2019:182826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Davis SE, Greevy RA, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Davis SE, Greevy RA, Fonnesbeck C, Lasko TA, Walsh CG, Matheny ME. A nonparametric updating method to correct clinical prediction model drift. J Am Med Inform Assoc. 2019;26(12):1448–57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Liu A, Lu J, Zhang G. Concept Drift Detection via Equal Intensity k-Means Space Partitioning. IEEE Trans Cybern. 2021;51(6):3198–211. [DOI] [PubMed] [Google Scholar]

[R12] 12.Tripathi S, Muhr D, Brunner M, Jodlbauer H, Dehmer M, Emmert-Streib F. Ensuring the Robustness and Reliability of Data-Driven Knowledge Discovery Models in Production and Manufacturing. Front Artif Intell. 2021;4:576892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Jenkins DA, Sperrin M, Martin GP, Peek N. Dynamic models to predict health outcomes: current status and methodological challenges. Diagn Progn Res. 2018;2:23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Wang K, Lu J, Liu A, Zhang G, Xiong L. Evolving Gradient Boost: A Pruning Scheme Based on Loss Improvement Ratio for Learning Under Concept Drift. IEEE Trans Cybern. 2021;PP. [DOI] [PubMed] [Google Scholar]

[R15] 15.Ross MK, Wei W, Ohno-Machado L. “Big data” and the electronic health record. Yearb Med Inform. 2014;9(1):97–104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Hammer MM, Kapoor N, Desai SP, Sivashanker KS, Lacson R, Demers JP, et al. Adoption of a Closed-Loop Communication Tool to Establish and Execute a Collaborative Follow-Up Plan for Incidental Pulmonary Nodules. AJR Am J Roentgenol. 2019:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Uhlig K, Berns JS, Carville S, Chan W, Cheung M, Guyatt GH, et al. Recommendations for kidney disease guideline updating: a report by the KDIGO Methods Committee. Kidney Int. 2016;89(4):753–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Nair A, Bartlett EC, Walsh SLF, Wells AU, Navani N, Hardavella G, et al. Variable radiological lung nodule evaluation leads to divergent management recommendations. Eur Respir J. 2018;52(6). [DOI] [PubMed] [Google Scholar]

[R19] 19.Hammer MM, Palazzo LL, Kong CY, Hunsaker AR. Cancer Risk in Subsolid Nodules in the National Lung Screening Trial. Radiology. 2019;293(2):441–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol. 2008;61(11):1085–94. [DOI] [PubMed] [Google Scholar]

[R21] 21.AlQabbany AO, Azmi AM. Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams. Entropy (Basel). 2021;23(7). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Jameel SM, Hashmani MA, Rehman M, Budiman A. An Adaptive Deep Learning Framework for Dynamic Image Classification in the Internet of Things Environment. Sensors (Basel). 2020;20(20). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Guo H, Zhang S, Wang W. Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Netw. 2021;142:437–56. [DOI] [PubMed] [Google Scholar]

[R24] 24.Pedregosa F, Weiss R, Brucher M. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]

[R25] 25.Lacson R, Prevedello LM, Andriole KP, Gill R, Lenoci-Edwards J, Roy C, et al. Factors associated with radiologists’ adherence to Fleischner Society guidelines for management of pulmonary nodules. J Am Coll Radiol. 2012;9(7):468–73. [DOI] [PubMed] [Google Scholar]

[R26] 26.Desai SP, Jajoo K, Taber K, Chukwu A, Emani S, Neville BA, et al. A Quality Improvement Intervention Leveraging a Safety Net Model for Surveillance Colonoscopy Completion. Am J Med Qual. 2021. [DOI] [PubMed] [Google Scholar]

[R27] 27.Desai S, Kapoor N, Hammer MM, Levie A, Sivashanker K, Lacson R, et al. RADAR: A Closed-Loop Quality Improvement Initiative Leveraging A Safety Net Model for Incidental Pulmonary Nodule Management. Jt Comm J Qual Patient Saf. 2021;47(5):275–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Kripalani S, LeFevre F, Phillips CO, Williams MV, Basaviah P, Baker DW. Deficits in communication and information transfer between hospital-based and primary care physicians: implications for patient safety and continuity of care. JAMA. 2007;297(8):831–41. [DOI] [PubMed] [Google Scholar]

[R29] 29.Singal AG, Gupta S, Skinner CS, Ahn C, Santini NO, Agrawal D, et al. Effect of Colonoscopy Outreach vs Fecal Immunochemical Test Outreach on Colorectal Cancer Screening Completion: A Randomized Clinical Trial. JAMA. 2017;318(9):806–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Smith-Bindman R, Miglioretti DL, Johnson E, Lee C, Feigelson HS, Flynn M, et al. Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996–2010. JAMA. 2012;307(22):2400–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Kapoor N, Lacson R, Khorasani R. Workflow Applications of Artificial Intelligence in Radiology and an Overview of Available Tools. J Am Coll Radiol. 2020;17(11):1363–70. [DOI] [PubMed] [Google Scholar]

[R32] 32.Carrodeguas E, Lacson R, Swanson W, Khorasani R. Use of Machine Learning to Identify Follow-Up Recommendations in Radiology Reports. J Am Coll Radiol. 2019;16(3):336–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Hu C, Anjur V, Saboo K, Reddy KR, O’Leary J, Tandon P, et al. Low Predictability of Readmissions and Death Using Machine Learning in Cirrhosis. Am J Gastroenterol. 2021;116(2):336–46. [DOI] [PubMed] [Google Scholar]

[R34] 34.Arvind V, London DA, Cirino C, Keswani A, Cagle PJ. Comparison of machine learning techniques to predict unplanned readmission following total shoulder arthroplasty. J Shoulder Elbow Surg. 2021;30(2):e50–e9. [DOI] [PubMed] [Google Scholar]

[R35] 35.Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.(US) IoM. Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. 2011. [PubMed]

[R37] 37.Mollura DJ, Soroosh G, Culp MP, Group R-ACW. 2016 RAD-AID Conference on International Radiology for Developing Countries: Gaps, Growth, and United Nations Sustainable Development Goals. J Am Coll Radiol. 2017;14(6):841–7. [DOI] [PubMed] [Google Scholar]

[R38] 38.Rubin DL. Informatics in radiology: Measuring and improving quality in radiology: meeting the challenge with informatics. Radiographics. 2011;31(6):1511–27. [DOI] [PubMed] [Google Scholar]

[R39] 39.Wells PS, Anderson DR, Rodger M, Ginsberg JS, Kearon C, Gent M, et al. Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer. Thromb Haemost. 2000;83(3):416–20. [PubMed] [Google Scholar]

[R40] 40.Shinagare AB, Alper DP, Hashemi SR, Chai JL, Hammer MM, Boland GW, et al. Early Adoption of a Certainty Scale to Improve Diagnostic Certainty Communication. J Am Coll Radiol. 2020;17(10):1276–84. [DOI] [PubMed] [Google Scholar]

[R41] 41.Pianykh OS, Langs G, Dewey M, Enzmann DR, Herold CJ, Schoenberg SO, et al. Continuous Learning AI in Radiology: Implementation Principles and Early Applications. Radiology. 2020;297(1):6–14. [DOI] [PubMed] [Google Scholar]

[R42] 42.Jin R, Furnary AP, Fine SC, Blackstone EH, Grunkemeier GL. Using Society of Thoracic Surgeons risk models for risk-adjusting cardiac surgery results. Ann Thorac Surg. 2010;89(3):677–82. [DOI] [PubMed] [Google Scholar]

[R43] 43.Siregar S, Nieboer D, Vergouwe Y, Versteegh MI, Noyez L, Vonk AB, et al. Improved Prediction by Dynamic Modeling: An Exploratory Study in the Adult Cardiac Surgery Database of the Netherlands Association for Cardio-Thoracic Surgery. Circ Cardiovasc Qual Outcomes. 2016;9(2):171–81. [DOI] [PubMed] [Google Scholar]

[R44] 44.Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Med Inform. 2020;8(3):e17984. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Hersh W Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Brief Bioinform. 2005;6(4):344–56. [DOI] [PubMed] [Google Scholar]

[R46] 46.Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. NPJ Digit Med. 2021;4(1):93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Yang X, Bian J, Hogan WR, Wu Y. Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27(12):1935–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Machine Learning Model Drift: Predicting Diagnostic Imaging Follow-Up as a Case Example

Ronilda Lacson

Mahsa Eskian

Andro Licaros

Neena Kapoor

Ramin Khorasani

Abstract

Objective:

Methods:

Results:

Discussion:

Summary Statement:

Introduction

Methods

Study Setting and Cohort

Manual Data Annotation

Feature Extraction

Machine Learning Re-Training and New Models

Validation Framework

Performance of Baseline Model over Time

Results

Report Cohorts

Table 1:

Model Performances

Table 2:

Baseline Model Performance over Time

Figure 1: Baseline Model Performance over Time.

Rate of Follow-up Recommendations

Table 3:

Discussion

Take-Home Points.

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Machine Learning Model Drift: Predicting Diagnostic Imaging Follow-Up as a Case Example

Ronilda Lacson

Mahsa Eskian

Andro Licaros

Neena Kapoor

Ramin Khorasani

Abstract

Objective:

Methods:

Results:

Discussion:

Summary Statement:

Introduction

Methods

Study Setting and Cohort

Manual Data Annotation

Feature Extraction

Machine Learning Re-Training and New Models

Validation Framework

Performance of Baseline Model over Time

Results

Report Cohorts

Table 1:

Model Performances

Table 2:

Baseline Model Performance over Time

Figure 1: Baseline Model Performance over Time.

Rate of Follow-up Recommendations

Table 3:

Discussion

Take-Home Points.

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases