Abstract
Delirium is an acute neurocognitive disorder, which is difficult to identify and predict. Using GEMINI1, Canada’s largest hospital data and analytics study, we had a labeled sample of around 4,000 cases with approximately 25% of cases being labeled as having delirium. Based on this labeled data, we developed machine learning (ML) models and interacted with physicians to interpret the ML models and their predictions. We developed a preliminary Explainable Artificial Intelligence (XAI) framework for physician experience design (PXD) to improve the uptake of ML models by improving the transparency of model results, thereby increasing physician trust in models as well as the uptake of model results for clinical decision making. We developed our PXD approach first with Conceptual Investigation to collect and extract physicians’ feedback on ML models and their evaluation requirements. We carried out a case study, working closely with the physicians in a participatory design process to develop a dashboard2 that presents ML delirium identification results interactively based on physician selections and inputs. In this approach a physician-preferred ML model for clinical decision making is selected through PXD evaluation.
1. Introduction
Our discussion of physician experience design (PXD) will be framed around a delirium prediction case study. Delirium is described as “acute brain failure” and is considered both a “medical emergency” and “quiet epidemic” [1, 2]. It is the most common neuropsychiatric condition in medically ill and hospitalized patients [3] and recognized as a quality of care indicator in Canada and the United States (U.S.) [4, 5]. Symptoms of delirium can be severe and distressing for both patients and caregivers and result from a complex interaction between predisposing and precipitating factors [1, 2]. Affecting up to 50% of older hospital patients, those with delirium are more than twice as likely to die in hospital or require nursing home placement [6, 7]. The long-term effects of delirium are serious, as it is associated with worsening cognitive impairment and incident dementia [7, 8]. Patients with delirium have longer hospitalizations, increased readmission rates, and more than double the healthcare costs. More recent estimates suggest that it accounts for $183 billion dollars of annual healthcare expenditures in the U.S. [8]. With the growing availability of electronic clinical data repositories such as the one used in this study, methods such as data mining and machine learning can support clinical decision making. However, in order to be useful machine learning (ML) predictions need to be incorporated into clinical workflow. In practice this means that physicians should find predictions, and the manner in which they are presented, to be useful and usable. While there has been considerable discussion of how to improve the Human-AI interaction [9] using methods such as explainable AI (XAI), there has been little if any discussion of how to improve the user experience for physicians when they interact with ML models.
There is, however, relevant prior research concerning the physician user experience when retrieving and viewing clinically relevant information on a tablet or handheld device. For instance, the authors in [10] found that physicians preferred the use of white space, alternating shading of rows, and tabular formats for information presentation. The authors in [11] found that the preferred form factor for viewing relevant clinical evidence depended on physician roles. For instance, family physicians preferred tablets, while residents preferred a more mobile form factor that could fit easily into a pocket.
Additional motivation for PXD is provided by examples from human factors engineering. For instance, there were many crashes in World War 2 pilot training due to pilots misreading the height displayed by the altimeter. In the same way that altimeters need to be designed so that they are interpretable by the pilots who use them, ML prediction results need to be presented and framed in a way that will make them understandable and usable for physicians. Another key lesson from human factors in aviation is that short of full automation where the pilot is no longer needed, the pilot as supervisory controller needs to remain in “in the loop” so that the human operator can take over when the automation fails, or when manual input is needed [9]. This problem of designing appropriate user interfaces for highly, but not fully, automated systems is also one that threatens the safety of highly automated vehicles [12].
Clinical decision making is more like a supervisory control task than a continuous vehicular control task. Clinical decision making also has a feature where there is a division of roles between nurses who continuously monitor the patient and provide planned inputs, and physicians who consult with the patient and receive reports from time to time, ordering tests or interventions as needed. If ML prediction is to be more than an academic exercise, then prediction needs to be effectively integrated into clinical decision making and physician workflow. Since ML prediction is not now, and probably never will be, 100% correct, physicians need information about the quality and uncertainty of prediction being provided to them.
Unfortunately, automated machine learning (aML) methods such as deep learning often have a hidden decision making process and lack transparency. Inspired by the benefits of putting the human in-the-loop, XAI was proposed to help human users understand ML models cognition, trust the results and effectively manage the decision making process [13].
As we worked with the physicians in our research study we found that their informational needs went well beyond traditional evaluation metrics such as accuracy, sensitivity and specificity, F1, or Area under the ROC curve. In this paper we focus on the questions and concerns raised by the physicians while reviewing the output of ML models for identifying delirium. Inspired by the user experience design (UXD) approach we derive a preliminary model of PXD, within XAI, that is built around the key information that physicians need in their clinical decision making.
This work was carried out as part of a project to efficiently identify delirium cases during hospitalization using all data available from admission to discharge and in our work we initially focused on the methodological goal of demonstrating the value of incorporating physician expertise in evaluating the delirium identification performance of ML models.
Figure 1 summarizes our workflow incorporating physicians in ML models evaluation with two major parts, i.e., Coneptual Investigation (CI) for studying the physicians’ needs on evaluating ML models and Technical Investigation (TI) for implementing a platform (a dashboard in our case) to present the physician-selected results based on extracting information from CI. Instead of presenting physicians with the results that ML experts prefer, we show the delirium identification results based on physician preferences. The PXD approach enables physicians to evaluate model performance in an interactive way and thus makes ML models, with corresponding results, interpretable and usable. In the remainder of this paper we report on the lessons learned during this work with respect to the needs that physicians have for an appropriate PXD when incorporating ML models for delirium identification into their clinical decision making. The main contributions of this paper are as follows:
A physician experience design (PXD) approach is proposed within the XAI framework for making ML prediction model interpretable and usable in the context of clinical decision making.
Methods are proposed for collecting and extracting data to satisfy physicians’ requirements for model evaluation
Methods are proposed for reporting model evaluation metrics within the PXD framework
Methods are proposed for reporting the calibration and bias of models
Methods are proposed for assessing the stability of models, and evaluation metrics, over time
A PXD dashboard is developed to facilitate review of key aspects of model prediction during clinical decision making (using radar, calibration, and temporal stability charts) – in dropdown menu format
Figure 1:
Scheme of making machine learning models more usable integrating physician expertise with physician experience design (PXD).
2. Materials & Methods
2.1. GEMINI Study
GEMINI is a unique big data collaborative supporting cutting-edge quality improvement and research projects [14, 15, 16, 17]. It includes 370,000+ patient admissions from 207,000+ unique patients and 20+ hospitals across Ontario, Canada and it contains over a billion data points. A rigorous internal validation process has demonstrated 98-100% accuracy across key data types [18]. In this study we focus on GEMINI data from six large hospitals (St. Michael’s Hospital, Toronto General Hospital, Toronto Western Hospital, Trillium Credit Valley Hospital, Trillium Mississauga Hospital and Sunnybrook Hospital) containing data on all hospitalizations to general internal medicine (N= 240,000), in those six hospitals, from 2010-2017.
2.1.1. Research Ethical Review Board (REB) Approval
The Research Ethics Review Board (REB) at the Toronto Academic Health Science Network approved the GEMINI study on 08/31/2019 with REB reference number 15-087. The extension of the REB approval was issued on 08/17/2020 by the Unity Health Toronto REB under the same reference number 15-087.
Our paper is also part of the GEMINI sub-study, named “Using artificial intelligence to identify and predict delirium among hospitalized medical patients”, which was approved by University of Toronto (UofT) REB on 10/15/2019 with RIS Protocol Number 38377. The UofT REB approved the renewal of this sub-study on 09/01/2020 under the same reference number 38377.
2.1.2. GEMINI Data Set
In GEMINI, administrative health data are linked with clinical data extracted from hospital information systems at the individual patient level (Figure 2).
Administrative Data: Patient-level characteristics are collected from hospitals as reported to the Canadian Institute for Health Information Discharge Abstract Database (CIHI-DAD) and the National Ambulatory Care Reporting System (NACRS). Diagnosis data and interventions are coded using the enhanced Canadian International Statistical Classification of Diseases and Related Health Problems (ICD-10-CA) and the Canadian Classification of Health Interventions.
Clinical Data: Data from the electronic information systems in GEMINI include: laboratory test results (biochemistry, hematology, microbiology), blood transfusions, in-hospital medications, vital signs, room transfers, and routine clinical monitoring. The quality of key elements of these data has been ensured through statistical quality control processes and direct data validation [18]. GEMINI data extraction methods allow access to a wealth of data ideal for text processing methods, including radiologist reports of diagnostic imaging.
Figure 2:
Data Contained in GEMINI project.
The delirium cases were identified through manual medical record review by trained medical professionals using a validated method [19] in GEMINI. Developed by [20], this method has good sensitivity (74%) and specificity (83%) compared to clinical assessment and is considered a suitable gold-standard for identification of delirium for research and quality improvement [20]. Inter-rater reliability was assessed by having 5% of the charts blindly reviewed by a second abstractor, achieving 90% inter-rater reliability.
We used 11 data files from a subset of the GEMINI data set, which contained 3,862 hospital admissions that were labelled with delirium status (positive or negative). The data files include clinical and administrative data as described below.
2.2. Machine Learning (ML) Models Construction and Training for Delirium Status Identification
The scheme shown in Figure 1 was run with 12 supervised classification algorithms with the task of predicting delirium status. The 12 machine learning (ML) algorithms used, covering most categories of ML models used, were:
Ensemble ML models: Gradient Boosting Classifier (GBC); AdaBoost Classifier (ABC); Random Forest (RF);Voting Classifier Soft (VC-S).
Non-parametric ML models: k Nearest Neighbors (kNN); Decision Tree (DT).
Linear-parametric ML models: Logistic Regression (LR); Linear Support Vector Machine (LSVM); Linear Disriminant Analysis (LDA).
Non-linear parametric ML models: Quadratic Discriminant Analysis (QDA); Neural Network (NNW): Multi-layer Perceptron classifier in deep learning.
Bayesian-based ML models: Gaussian Naïve Bayes (GNB)
For the modeling, we split our integrated complete data into two parts, a training set and a testing set. As shown in Figure 3, the data extended over a five year period, from 04/01/2010 to 03/31/2015. We divided this period into 10 six-month segments. We treated the first 9 segments, i.e., 04/01/2010 to 09/30/2014, as the training set. The last-6-month period, i.e., 10/01/2014 to 03/01/2015, was used as hold-out data (i.e., the testing set) to test our models’ performance in making a prospective prediction. Cycling through different holdout sets over time allowed us to assess whether there was any non-stationarity in the data, which would affect our ability to predict delirium in the future based on models developed on currently available data. In the training set, we used 5-fold cross validation to tune the model parameters for each of 12 machine learning algorithms. We then used the tuned parameters from the 5-fold cross validation to identify delirium status of each admission in the testing/hold-out set.
Figure 3:
Data splits for models training and testing on a rolling basis.
2.3. Commonly Used ML Model Evaluation Metrics
We then tested the model performance on the hold-out testing set and calculated six evaluation metrics, i.e., accuracy, precision, recall/sensitivity, specificity, Area Under the Receiver Operating Characteristic Curve (ROC-AUC), F1 score.
Accuracy was calculated as the proportion of predicted labels that matched the corresponding ground truth labels. Precision= , where TP denotes the number of true positives. Recall= , where FN represents the number of false negatives. Specificity= , where TN and FP denote the number of true negatives and the number of false negatives, respectively. ROC curve was plotted using the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. ROC-AUC was obtained via the probability that our binary classifier ranked a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’). F1 score is a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. .
2.4. Physician Experience Design (PXD) Evaluation
2.4.1. Physician Team
We interacted with medical delirium experts who address delirium in both their research and clinical work. Our physician teams had four members (also co-authors on this paper): Dr. Fahad Razak is an internist and epidemiologist, with specialization in observational methods, ‘big data’ projects, and global health at St. Michael Hospital, Toronto, CA. Dr. Kathleen Sheehan is a staff psychiatrist with the medical psychiatry program at the Centre for Mental Health-University Health Network. Dr. Amol Verma is staff physician in General Internal Medicine at St. Michael’s Hospital and a scientist in the Li Ka Shing Knowledge Institute. Dr. Andrew Pinto is a Public Health and Preventive Medicine specialist and family physician at St. Michael’s Hospital, and an Assistant Professor at the University of Toronto.
2.4.2. Conceptual Investigation (CI)
Working with the physicians we developed a questionnaire over several iterations. Questions of interest related to the types of evaluation metric that they found most useful, and how they would prefer different evaluation metrics, or combinations of evaluation metric, to be visualized. Sample questions are shown in Figure 4.
Figure 4:
Illustration figure of partial questionnaire in Conceptual Investigation (CI).
We collected physician responses and summarize their recommendations as follows: 1). Use all six evaluation metrics but also allow a way for the physician to focus on an evaluation metric of particular interest; 2). False alarm and miss rates should also be available for viewing (e.g., in a confusion matrix); 3). Measures of temporal stability of prediction should be shown both as plots of prediction performance over time and also as plots or tables of standard deviations of performance measures; 4). Calibration is an important issue and a visualization of calibration results should be available; 5). Radar charts (i.e., plots showing the value of all evaluation metrics in a single plot, can help physicians and other stakeholders see all the values together and compare them visually); 5). The F1 score (i.e., a balanced score of precision and recall/sensitivity) is the most useful single measure, but recall and sensitivity are also of particular interest.
2.4.3. Technical Investigation (TI)
PXD Evaluations Based on the CI, we added two metrics into the results table: 1). False positive rate (FPR)= , also called false alarm, which answers the question like of all the people who are without delirium, how many of those we incorrectly predict? 2). False negative rate (FNR)= , also called miss rate, which answers the question of all the people who are with delirium, how many of those we incorrectly predict?
We also developed a stability measure integrating the aforementioned eight evaluation metrics, i.e., accuracy, precision, recall/sensitivity, specificity, ROC-AUC, F1 score, FPR and FNR. We calculated the standard deviation (SD) and range over 10 time segments on each of the eight metrics, using their variation to show the stability of performance. We used the harmonic mean to show average performance across the matrices, i.e., , where nis the total number of metrics and mi is the ith metric value.
We showed how well calibrated predictions were by using reliability diagrams (also called calibration curves) with one example being shown in Figure 7.(b). The dashed line is the perfectly calibrated line and the line around the perfectly calibrated curve is the calibration performance of the ML model (e.g., the blue curve in Figure 7.(b) is the calibration curve of the Gradient Boosting Classifier, abbreviated as GBC). When the curve falls below the diagonal, the ML model has over-forecast (estimating the delirium probability to be higher than it actually is, while the curve above the diagonal, indicates under-forecasting. Thus the calibration plot indicates the amount of bias towards positive or negative labelling at different levels of probability for the positive label.
Figure 7:
Performance of best-performed ML model Gradient Boosting Classifier (GBC) presented in PXD dashboard.
The Radar chart is another visualization that we added in the TI stage, where multiple evaluation metrics are compared in the form of a 2D plot with values plotted on a set of axes radiating from a common origin. An example radar plot is shown in Figure 7.(c), displaying several evaluation metrics for the GBC model in a single plot.
PXD Dashboard For better interaction with physicians, we developed a PXD dashboard to present the ML models results. Based on physicians’ feedback from CI and TI, we implemented three types of dashboard built in one overall interface, which can be accessed at https://pxd-dashboard.herokuapp.com/.
Figure 5 shows partial views of our PXD dashboard, where three dropdown menus allow users to customize their view of the results. The first dropdown allows a selection from different types of dashboard. The first type is PXD selected machine learning (ML) models. Only four ML models’ results were included in this type, based on the physicians’ feedback from CI. The second type of dashboard is a comparison of 12 ML Models. The third type shows results for all 12 models in a single view. Note that the performance ranking (PR) is created based on the F1 score consistent with the preferences of our physicians. The next dropdown chooses the model. Note that users can choose a particular ML model from the list of 12 models and can then choose the form of evaluation feedback from the following options: Confusion matrix, Calibration plot, Original Result Table, Radar chart, Temporal Performance and Temporal Stability Measure. Aside from the last option for temporal evaluation, which reviews 10 time segments (TS), the other evaluations only present results from the last TS/last-6-month (prospective hold-out) period. The Radar Chart, Calibration Plot and Result Table are presented with 12 results reflecting the 12 models used in our case study.
Figure 5:
Illustration figure of partial Physician Experience Design (PXD) dashboard in Technical Investigation (TI). Note that, “Help” in blue provides brief explanations for dashboard functionality while “Background Info” in green provides more information on the evaluation metrics.
3. Results
3.1. Experimental Setup
We built 12 machine learning (ML) models for identifying delirium status. The sample size of input data in each path is presented in Table 1. For more details of features description, please refer to [14, 15, 16, 17].
Table 1:
The sample size of input data.
| Number of admissions/rows | Number of features | |
| 1. iML-Delirium scheme | 3862 | 324 |
The 12 ML models, along with hyperparameter tuning and cross validation, were implemented in the Python package Scikit − learn [21]. Hyperparameter tuning was conducted using RandomizedSearchCV and GridSearchCV functions. Cross validation was employed via crossvalscore, crossvalidate and crossvalpredict functions. Gradient boosting classifier was trained via GradientBoostingClassifier function. AdaBoost classifier used AdaBoostClassifier function. Neural network classifier was implemented by MLPClassifier function. Decision tree classifier was employed via DecisionTreeClassifier function. k nearest neighborhood (kNN) was trained via KNeighborsClassifier function. Logistic regression classifier was conducted using LogisticRegression function. Random forest classifier was implemented via RandomForestClassifier function. Linear Support vector machine (SVM) was employed via svm function with kernel =I linearI. Gaussian Naïve Bayes was implemented via GaussianNB function. Linear discriminant analysis classifier was trained via LinearDiscriminantAnalysis function. Quadratic discriminat analysis classifier was employed via QuadraticDiscriminantAnalysis function. Voting classifiers with soft setting was implemented by V otingClassifier function.
3.2. Experimental Results
We trained these models with hyperparameter tuning and 5-fold cross validation across 9 time segments. We then tested the tuned model on a tenth time segment on a rolling basis so that all of 10 time segments were used as the holdout set on one occasion. The 12 ML models’ results can be accessed at https://pxd-dashboard. herokuapp.com/ in the third type of dashboard. Due to page limitation in this paper we only compare the performance of the 12 models in Figure 7 and show the best-performing one (i.e., GBC), selected by physicians through PXD, in Figure 6.
Figure 6:
12 ML models’ performance comparison presented in PXD dashboard.
4. Discussion
Our proposed physician experience design approach, within an XAI framework, uses a dashboard that enables effective interaction with physicians. The dashboard presents information to physicians efficiently, so that the ML models and their corresponding results can be interpreted in a way that physicians understand. The dashboard presented here only includes PXD selected ML models based on that CI stage, so that the demonstration of ML model performance can be quickly displayed to physicians. We implemented a screening process so that physicians wouldn’t be overwhelmed with information. The diverse visualization of results should give physicians a more intuitive sense of ML model performance. To give consideration to various users’ needs, we also developed two other types of dashboard comparing performance across 12 models.
In contrast to ML model selection based only on the values of evaluation metrics, physicians selected models based on more complex evaluation criteria such as the interpretability/explainability of the algorithm, good calibration, stable performance over time without failure in any time period, and higher sensitivity. As a result, GBC was selected by physicians because it’s a tree based model with good interpretability [22, 23], is well calibrated and its performance was relatively stable over time and with generally higher sensitivity than the other models had.
Currently, communication between developers team and physicians is via emails, meeting and questionnaire, which exhibits time latency and limits efficiency. To address this limitation, we plan to develop a compatible physician-centered interactive ML system including both ML models development and evaluation to boost the efficiency of the PXD process.
5. Conclusion
Delirium is a highly prevalent, preventable and treatable neurocognitive disorder, which is associated with very poor outcomes when untreated. It is characterized by acute onset of fluctuating mental status, psychomotor disturbance and hallucinations and thus is difficult to spot, creating an opportunity for higher quality care through automated identification of delirium, or of delirium risk. In the research reported in this paper, we have presented a PXD approach within an XAI framework using a dashboard that physicians can interact with. Allowing physicians to explore models and their predictions through the dashboard should improve trust in ML models and their corresponding results as well as an understanding of likely defects in model prediction so that the physicians can use ML more effectively in clinical decision making for delirium. Developing PXD for physicians is a delicate balance between providing a rich set of information (able to dive into details) and providing a concise overview (just what the physician needs to know). The physicians working with us in this study were comparatively knowledgeable about data science and ML prediction. Further research is needed to determine how much training different types of physician will need to have in order to be able to use ML prediction results effectively in their clinical decision making, and to determine the extent to which a PXD approach can reduce those training requirements.
Acknowledgment
The authors would like to thank the Canadian Institutes of Health Research Foundation and the National Science and Engineering Research Council for funding this work with a Collaborative Health Research Projects Grant (Application Number 415033). This research was also supported by an NSERC Discovery Grant to the second author (RGPIN-2018-06591).
Footnotes
https://www.geminimedicine.ca/
https://pxd-dashboard.herokuapp.com/
Figures & Table
References
- [1].Maldonado José R. Acute brain failure: pathophysiology, diagnosis, management, and sequelae of delirium. Critical care clinics. 2017;33(3):461–519. doi: 10.1016/j.ccc.2017.03.013. [DOI] [PubMed] [Google Scholar]
- [2].Han Jin H, Wilson Amanda, Wesley Ely E. Delirium in the older emergency department patient: a quiet epidemic. Emergency Medicine Clinics. 2010;28(3):611–631. doi: 10.1016/j.emc.2010.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Maldonado José R. Delirium pathophysiology: An updated hypothesis of the etiology of acute brain failure. International journal of geriatric psychiatry. 2018;33(11):1428–1457. doi: 10.1002/gps.4823. [DOI] [PubMed] [Google Scholar]
- [4].Ontario Health Provincial Geriatrics Leardership Delirium care for adults quality standards. https://www.hqontario.ca/Portals/0/documents/evidence/quality-standards/qs-delirium-quality-standard-en.pdf.
- [5].Gage L, Hogan DB. Ccsmh guideline update: The assessment and treatment of delirium. 2014.
- [6].Inouye Sharon K. Delirium in older persons. New England journal of medicine. 2006;354(11):1157–1165. doi: 10.1056/NEJMra052321. [DOI] [PubMed] [Google Scholar]
- [7].Yaffe Kristine, Weston Andrea, Graff-Radford Neill R, Satterfield Suzanne, Simonsick Eleanor M, Younkin Steven G, Younkin Linda H, Kuller Lewis, Ayonayon Hilsa N, Ding Jingzhong, et al. Association of plasma β-amyloid level and cognitive reserve with subsequent cognitive decline. Jama. 2011;305(3):261–266. doi: 10.1001/jama.2010.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Hshieh Tammy T, Yang Tinghan, Gartaganis Sarah L, Yue Jirong, Inouye Sharon K. Hospital elder life program: systematic review and meta-analysis of effectiveness. The American Journal of Geriatric Psychiatry. 2018;26(10):1015–1033. doi: 10.1016/j.jagp.2018.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Amershi Saleema, Weld Dan, Vorvoreanu Mihaela, Fourney Adam, Nushi Besmira, Collisson Penny, Suh Jina, Iqbal Shamsi, Bennett Paul N, Inkpen Kori, et al. Guidelines for human-ai interaction. Proceedings of the 2019 chi conference on human factors in computing systems. 2019. pp. 1–13.
- [10].Takeshita Harumi, Davis Dianne, Straus Sharon E. Proceedings of the Human Factors and Ergonomics Society Annual Meeting. volume 46, Los Angeles, CA: SAGE Publications Sage CA; 2002. Clinical evidence at the point of care in acute medicine: a handheld usability case study; pp. 1409–1413. [Google Scholar]
- [11].Lottridge Danielle M, Chignell Mark, Danicic-Mizdrak Romana, Pavlovic Nada J, Kushniruk Andre, Straus Sharon E. Group differences in physician responses to handheld presentation of clinical evidence: a verbal protocol analysis. BMC medical informatics and decision making. 2007;7(1):1–12. doi: 10.1186/1472-6947-7-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Hancock Peter A, Kajaks Tara, Caird Jeff K, Chignell Mark H, Mizobuchi Sachi, Burns Peter C, Feng Jing, Fernie Geoff R, Lavallière Martin, Noy Ian Y, et al. Challenges to human drivers in increasingly automated vehicles. Human factors. 2020;62(2):310–328. doi: 10.1177/0018720819900402. [DOI] [PubMed] [Google Scholar]
- [13].Linardatos Pantelis, Papastefanopoulos Vasilis, Kotsiantis Sotiris. Explainable ai: A review of machine learning interpretability methods. Entropy. 2021;23(1):18. doi: 10.3390/e23010018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Verma Amol A, Guo Yishan, Kwan Janice L, Lapointe-Shaw Lauren, Rawal Shail, Tang Terence, Weinerman Adina, Cram Peter, Dhalla Irfan A, Hwang Stephen W, et al. Patient characteristics, resource use and outcomes associated with general internal medicine hospital care: the general medicine inpatient initiative (gemini) retrospective cohort study. CMAJ open. 2017;5(4):E842. doi: 10.9778/cmajo.20170097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Verma Amol A, Masoom Hassan, Rawal Shail, Guo Yishan, Razak Fahad. Pulmonary embolism and deep venous thrombosis in patients hospitalized with syncope: a multicenter cross-sectional study in toronto, ontario, canada. JAMA internal medicine. 2017;177(7):1046–1048. doi: 10.1001/jamainternmed.2017.1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Verma Amol A, Guo Yishan, Kwan Janice L, Lapointe-Shaw Lauren, Rawal Shail, Tang Terence, Weinerman Adina, Razak Fahad. Prevalence and costs of discharge diagnoses in inpatient general internal medicine: a multi-center cross-sectional study. Journal of general internal medicine. 2018;33(11):1899–1904. doi: 10.1007/s11606-018-4591-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Rawal Shail, Kwan Janice L, Razak Fahad, Detsky Allan S, Guo Yishan, Lapointe-Shaw Lauren, Tang Terence, Weinerman Adina, Laupacis Andreas, Subramanian SV, et al. Association of the trauma of hospitalization with 30-day readmission or emergency department visit. JAMA internal medicine. 2019;179(1):38–45. doi: 10.1001/jamainternmed.2018.5100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Verma Amol A, Pasricha Sachin V, Jung Hae Young, Kushnir Vladyslav, Mak Denise YF, Koppula Radha, Guo Yishan, Kwan Janice L, Lapointe-Shaw Lauren, Rawal Shail, et al. Assessing the quality of clinical and administrative data extracted from hospitals: the general medicine inpatient initiative (gemini) experience. Journal of the American Medical Informatics Association. 2021;28(3):578–587. doi: 10.1093/jamia/ocaa225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Raghupathi Wullianallur, Raghupathi Viju. Big data analytics in healthcare: promise and potential. Health information science and systems. 2014;2(1):1–10. doi: 10.1186/2047-2501-2-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Inouye Sharon K, Leo-Summers Linda, Zhang Ying, Bogardus Sidney T, Jr, Leslie Douglas L, Agostini Joseph V. A chart-based method for identification of delirium: validation compared with interviewer ratings using the confusion assessment method. Journal of the American Geriatrics Society. 2005;53(2):312–318. doi: 10.1111/j.1532-5415.2005.53120.x. [DOI] [PubMed] [Google Scholar]
- [21].Pedregosa Fabian, Varoquaux Gaël, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel Mathieu, Pretten-hofer Peter, Weiss Ron, Dubourg Vincent, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research. 2011;12:2825–2830. [Google Scholar]
- [22].Gunning David, Stefik Mark, Choi Jaesik, Miller Timothy, Stumpf Simone, Yang Guang-Zhong. Xai—explainable artificial intelligence. Science Robotics. 2019;4(37) doi: 10.1126/scirobotics.aay7120. [DOI] [PubMed] [Google Scholar]
- [23].Arrieta Alejandro Barredo, Díaz-Rodríguez Natalia, Ser Javier Del, Bennetot Adrien, Tabik Siham, Barbado Alberto, García Salvador, Gil-López Sergio, Molina Daniel, Benjamins Richard, et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion. 2020;58:82–115. [Google Scholar]













