Abstract
There is growing interest in using AI-based algorithms to support clinician decision-making. An important consideration is how transparent complex algorithms can be for predictions, particularly with respect to imminent mortality in a hospital environment. Understanding the basis of predictions, the process used to generate models and recommendations, how to generalize models based on one patient population to another, and the role of oversight organizations such as the Food and Drug Administration are important topics. In this paper, we debate opposing positions regarding whether these algorithms are ‘ready yet’ for use today in clinical settings for physicians, patients and caregivers. We report voting results from participating audience members in attendance at the conference debate for each of these positions obtained real-time from a smartphone-based platform.
INTRODUCTION
Artificial intelligence (AI) is actively being pursued by many to aid physician decision-making in a large variety of applications. The ability to inspect how recommendations were formed is challenged by: 1) the complexity of algorithms, 2) the need for companies to protect their intellectual property, and 3) the numeric literacy of physicians for interpreting probabilistic suggestions that may or may not be based on patients with similar characteristics as their current patient. Therefore, in parallel with increased activity improving complicated predictive algorithms, there is increased activity in exploring different approaches to inspecting the basis for predictions, which typically is conducted under the label of Explainable Artificial Intelligence (XAI; Core and colleagues, 2006).
Generally, the process of Artificial Intelligence for prediction includes 6 steps:
Data collection: Define the prediction problem and collect the data we need. The quality and quantity of data that we gather will directly determine how good our predictive model can be.
Data preprocessing: Clean data and prepare it for training. It often includes data cleaning, data transformation and data deduction.
Feature selection and feature engineering: Use domain knowledge to extract new variables from the data we get after data preprocessing.
Model building: Train machine learning algorithm to make a prediction and validate it on handout data.
Model evaluation: Select the measures and do the evaluation. It helps to find the best model that represent our data and how well the chosen model will work in the future.
Model tuning: Tune the hyperparameters of the model to make the best prediction. It helps model generate the most accurate outcomes and make the most effective decision-making suggestion for human.
In the human factors literature, Klein, Hoffman, Woods and colleagues (2004) identified the need for AI-based systems to give transparent bases for judgments, recommendations, actions taken independently by systems, and the parameters under which AI is allowed to take independent actions. In addition, they stated that AI-based systems need to be able to identify when they are struggling to accomplish an ordered task early enough to allow a transition of responsibility to the human. On the other hand, Meyer and Sheridan (2017) have found that human operators tend to respond more strongly to false positive indications than is desirable when setting alarm thresholds, and so users may need to be cautioned not to over-react to warnings that a system is anticipated to fail by ‘throwing away’ the capability.
In a recent literature review (Hansen, Allen, Bitan, and Patterson, 2019, in press), we identified the following examples (which are a non-inclusive list based upon convenience sampling of targeted modeling and statistical methods) of AI-based algorithms:
neural nets to predict cardiac mortality (Song et al., 2004) and skin cancer (Dorj et al., 2018),
multi-dimension numeric models with a support vector machine (SVM) algorithm to predict ICU post-discharge mortality (Luo & Ramshisky, 2016),
tree algorithms using a random forest algorithm to predict sepsis (Taylor et al., 2016),
unsupervised topic models using a Latent Dirichlet Allocation (LDA) statistical model to predict patient mortality (Hassanpour & Langlotz, 2016), and
an augmented unsupervised topic model that incorporates both clustered and signal-based trends derived from continuous signal data as time series data to predict clinical decompensation (Ren et al., 2018).
METHODS
A debate was conducted. Three examples motivated the discussions:
Specialized physicians screening for neurological conditions using deep learning neural nets
Bedside registered nurses responding to auditory alarms indicating cardiac event predictions based on a proprietary algorithm based on telemetry data from hospital patients
Predicting respiratory compromise after surgery based upon population-based analyses of respiratory events and in-hospital mortality as outcomes and diagnostic related codes and demographic variables as inputs
Three debaters, Emily Patterson, Emily Patterson, and C.J. Hansen, defended the following positions for these questions:
- Should we use this now for clinical care?
- Emily: No
- Ted: Yes
- C.J.: Yes
- Should we use this now for caregivers in hospitals?
- Ted: Yes
- Emily: No
- C.J.: Yes only for caregivers who opt-in
- Should individual patient probability ranges be required to be displayed for: A) clinicians, and B) patients?
- Ted: Yes (A)/Yes (B)
- Emily: No (A)/No (B)
- C.J.: Yes (A)/No (B)
- Should we require disclosure of algorithm details to oversight bodies such as the Food and Drug Administration or the Office of the National Coordinator?
- Ted: No
- Emily: Yes
- C.J.: Some disclosure
DEBATE POSITIONS AND VOTES
For Round 1 of the debate, the following information was provided to give context to the discussion. The application was specialized physicians screening for neurological conditions using deep learning neural nets. The primary data inputs were from a black and white dot matrix diagram representing a time series of pressure plate data. There was a small sample size of patients, but each patient had millions of data points. In general, the model performed well compared to other approaches in the published literature. Also, the signature of the condition was fairly clear visually.
As a preliminary step, Table 1 shows an overview of the process for generating the analysis. At most stages, there were both more and less thorough options. Using experimental design to pick a varied set of infants might make the models more trustworthy, but that is more expensive. Also, making up or “augmenting” data could be used, e.g., adding fake infants by scaling or rotating the current infants. Some augmentations might even lead to safer modeling, such as “adversarial” or multiple fidelity experimental design (Huang et al., 2005). From model fitting through predicting in new cases, choices were fairly standard relating to “deep learning” in that the training data is perfectly predicted leading to “zero residuals” and no traditional ability to estimate p-values or prediction intervals. Modelling is updated on different timescales for different applications, ranging on sepsis predictors in hospitals every three weeks to outpatient models that predict three, six, and twelve month patient deterioration for patients with asthma.
Table 1.
The stages of modeling and modeling options showing a spectrum of options at each stage.
| Data Gathering or Generation |
Model Fitting |
Validation & Verification |
New Cases |
Decisions |
|---|---|---|---|---|
| Use Experimental Design (Spread Out Inputs) Just Regular Data “Augment” Data (Rotations or Imputations) Either Arbitrary or Optimal Adversarial Examples | Traditional Method with Nonzero Residuals Deep Learning Method With Zero Residuals (More Parameters Than Data) | p-values and Hypothesis Testing Area Under The Curve on a Cross-Validation-Based ROC Precision-Recall/Confusion Matrix |
Traditional Intervals on Probabilities Thresholds Based on Probability or Resampling |
Leads to More Testing Leads to Immediate Treatment |
For this situation, trained healthcare personnel could probably do this task well in a few minutes, but possibly not as well as the algorithm when a similar patient population is used as was used for model development.
Round 1
In response to the question: “Should we use this now for clinical care?” the following positions were taken:
1). Emily: No
We are not anywhere near ready yet to use this for clinical care based on this study with a small number of patients with unique characteristics. Why would we substitute an expensive, complicated, difficult-to-maintain algorithm for a trained clinician who can reliably do the job well already? It may sound like a simple screening decision, and thus having low risks if the infant does not have the condition, but there is a ‘slippery slope’ argument that other more complex, higher risk decisions will then be pursued. Even for this condition, if screening is positive, then the infant will take on the risks associated with having anesthetic medications administered in order to get an MRI image. Due to the young ages and weights of these patients, there are also potential risks of having the wrong dose of the medication administered. In the end, there is no clear reason to bring these kinds of approaches, where we don’t really understand how the model was constructed, how the algorithms can fail, how to inspect the code for mistakes, or how much to trust the recommendations, into a situation where diagnosis is already working reasonably well with available data and approaches and we can ask the trained experts what they are basing their interpretations upon and how certain they are.
2). Ted: Yes
The algorithm we designed has been shown to have high sensitivity and high specificity, for both our training model and actual data. Besides that, it has been shown to have a high Area Under the Curve (AUC) of 98% for real data. Our training set may be small, but we have been able to effectively narrow down the exact movement that a baby with the neurological condition makes, and our algorithm can identify it. This is simply a screening mechanism for a lifelong condition that can be easily detected through our algorithm. There are no life or death stakes on the line, and the early information can be greatly beneficial to new parents as soon as they can get it. If we have shown that our model can identify the condition better than even an expensively trained (which are not easily available in many settings) human would otherwise be able to, why should we not use it? At this point, it would be immoral not to allow its use, particularly since the findings are validated by further screening with an MRI image before a diagnosis is made by a clinician.
3). C.J.: Yes
While I believe this diagnostic tool is a successful pursuit of algorithms in medicine, we can’t forget the increased complexity that comes with introducing a machine learning algorithm into diagnosis. Algorithms that are trained on a small sample run the risk of overfitting their model and making poorly-based decisions. I think it is wise to be cautious when having to create data in order to make a successful algorithm.
We must also be aware of factors beyond the scope of the hospital environment, such as corporate ethics. When medical device companies design and sell these algorithms, what responsibilities do they take on? Will companies be obliged to update their algorithms if they are found to contain a mistake? It was recently announced that software updates to Apple iPhones included code that made programs run slower on older phones, incentivizing people to upgrade to newer models. Could a similar situation happen when lives are at risk, with device companies incentivizing hospitals to update to their expensive new versions of algorithms? These problems must be considered when rolling out and regulating new algorithms that are purported to “solve” all issues.
I believe that machine learning algorithms can and should be implemented in hospitals, but only with a full consideration of additional risk that hospitals are undertaking. Algorithms should be implemented if they are significantly better than their human counterparts, and only so in non-high-risk scenarios such as this infant diagnosis scenario. Otherwise, humans should be left in the driver’s seat.
Round 1 results of voting (out of approximately 86 attendees) were:
Ted: 15/37 (40.5%)
Emily: 14/37 (37.8%)
C.J.: 6/37 (16.2%)
3-way tie: 2/37 (5.4%)
Round 2
Next, we considered a case in which the modeler was not willing to share details about the process. The modeler is simply stated to be reputable and predicting mortality and sharing predictions with clinicians and patients. In response to the question: “Should we use this now for caregivers in hospitals?” the following positions were taken:
1). Ted: Yes
The current problems that are known about false alarms with medical alarms are not necessarily an indictment of machine learning modeling. In fact, modeling methods might create much more useful alarms. Also, I don’t like when people are assumed to be irresponsible and unable to handle information. Caregivers are not stupid. They can be given useful information on the people they are looking over including probabilities and intervals on these probabilities.
2). Emily: No
I currently serve as the Principal Investigator on an AHRQ-funded project to reduce alarm overload with telemetry alarms for bedside nurses in acute care settings in hospitals. Our findings clearly show that patients’ rooms are already inundated with alarms, most of which are false or non-actionable already. As one caregiver stated in a recent survey with respect to existing auditory alarms in the hospital room: “They are anxiety provoking and do not allow the patient to rest. If there is a way to design the alarm system as to alert the nurse only and they can determine if the situation is urgent and necessary, I think it would serve everyone best. This would allow for maximum rest and recovery for the patient.”
Adding yet another layer of alarms for nurses on top of alarms from bedside monitors, central monitoring stations, and escalated alarms to hospital-provided cellphones or pagers, is highly unlikely to reduce avoidable patient mortality in the hospital. In the end, predictive models currently on the market are using the same data as inputs and are not calibrating the risk to prior probabilities based upon patient histories or data indicating high-risk cohorts from electronic health record data or operating room respiratory patterns.
When nurses do not immediately respond to auditory alarms, caregivers are concerned and surprised. One caregiver was asked on a survey this question: “During past hospital stays, either as a patient, family member, or visitor, did anything make you angry or surprise you regarding alarm sounds? Please explain”. Her response was: “Yes, when my identical twin sister was dying of ovarian cancer and the nurses, who were otherwise wonderful, did not respond instantly. I was frightened when her alarms went off and no one responded immediately.”
As part of our project, we are exploring how to increase compliance with hospital policies and procedures for patient cohorts identified for decades by the American Hospital Association that should not be placed on telemetry monitoring. From the family and caregiver perspective, patients who have elected to have the status of Do Not Resuscitate-Comfort Care (DNR-CC) do not benefit in any way from telemetry monitoring. The unnecessary noise, activity, and additional care activities resulting from monitored cardiac function in a patient with increasingly compromised function is best avoided. In particular, an important risk of monitored cardiac function is that the officially stated preference of forgoing resuscitation in the case of cardiac arrest is accidentally overridden during the required response to a ‘no signal’ (asystole) alarm.
When we start sharing the outputs of predictive models with caregivers, it is much different than sharing them with clinicians. Clinicians have access to all of the data from electronic health records and sensor data when they receive information, as well as the clinical knowledge to place that information in context with respect to other patients. In some cases, the output from the predictive model might be the only information which is provided to the caregivers. As such, it could be inferred to be coming from a clinical team with a higher level of certainty than is warranted. In response to alarms, caregivers have few options. They can call their nurse using a call button or ask people who enter the hospital room to explain the output to them. They can talk with family members to understand what is likely to be happening. They can go into the hallway and call out asking for immediate help. The most powerful action which they can take is pulling the cord for ‘code blue’ which is meant only to be pulled by clinical staff, which initiates a Rapid Response Team to arrive as soon as possible. In the event that they pull this cord, the emergency system will predictably have more false alarms, and thus eventually be less effective for patients who really need that support. If clinical staff do not respond quickly in that situation, in part because they are aware that the cord was pulled by a caregiver, the patient satisfaction score (HCAHPS for Hospital Consumer Assessment of Health Plans Survey), which now is included in reimbursement payments, is likely to be lower. Notably, the HCAHPS survey includes the question: “During this hospital stay, after you pressed the call button, how often did you get help as soon as you wanted it?”
Overall, providing uncertain information to caregivers in a way that is not contextualized by the prior risk for a patient’s experiencing a cardiac event will predictably increase stress without any clear benefit for patient safety. There is no benefit to providing information unless it is actionable and has a high degree of certainty, particularly when it relates to predicting potentially fatal cardiac events. The auditory alarm landscape in hospitals is already broken, and doing this would make the situation worse, not better.
3). C.J.: Yes, only for caregivers who opt-in
Willing caregivers should have the choice to opt into learning this tool. Caregivers are on the front lines of assisting patients in need. If they are well educated, their ability in enhancing patient wellness can be brought to bear. A recent study found that caregivers are frequently not engaged by medical personnel during outpatient visits, which is concerning considering the positive effect that caregivers have on patient-important outcomes (Boehmer et al., 2014). Giving caregivers access to this kind of system can greatly enhance their ability to assist in patient outcomes.
In order to provide caregivers with the full ability to give consent to opt-into using such a system, we must support fully informing them. This includes providing information on which the data that predictive algorithms are based. For example, recently companies have begun to use a patient’s ZIP+4 code to determine which patients are at “rising risk” and thus need an alarm. On the basis of the ZIP code inferences can be made about educational level, digital fluency, nutritional choices, social support in the home, and transportation capabilities, primarily based on inferences made about income levels. Similarly, data from social media posts and online buying patterns can aid these inferences. Caregivers must be informed about metrics such as this in order to establish trust and rapport.
With this in mind, it is equally important to remember that caregivers are not trained as physicians. By placing this kind of AI system in the hands of people that don’t want to use it, we run the risk that they will over-assume the abilities of the algorithm. If this system is to be used, a caregiver must be briefed by a professional on what exactly the outputs mean, and what specific abilities they are given by the information. Caregivers should have the option to not be trained. However, eager caregivers that are willing and able to learn about the AI with which they are working can be of great benefit to patient wellness.
Round 2 results of voting (out of approximately 90 attendees) were:
Emily: 28/45 (62.2%)
C.J.: 13/45 (28.9%)
Ted: 4/45 (8.9%)
3-way tie: 0/45 (0%)
Round 3
In response to the questions, these positions were taken: “Should individual patient probability ranges be required to be displayed for: A) clinicians, and B) patients?”
1). Ted: Yes (A)/Yes (B)
Again, the more useful information that can be given to doctors and patients, the better. Probability ranges allow for better prediction and inference of problems. The algorithm is merely letting you know what it is seeing. There isn’t any harm in warnings that are put out by an early detection system, and, if anything, this information will help to inform and prevent critical events from happening to a patient.
2). Emily: No (A)/No (B)
It is irresponsible to be providing this kind of probabilistic information to doctors and patients without a true understanding of the diagnosis of a patient. In this same space, there are already talks of using a similar kind of algorithm that operates in prenatal screening for chronic disorders. Imagine a mother that may unnecessarily choose an abortion because of a faulty output from an algorithm. We should not allow doctors and patients to place blind trust in machine probabilities. Significant biases occur, and patients and doctors are bound to misinterpret the information they receive with dire consequences.
3). C.J.: Yes (A)/No (B)
Probability ranges should be provided for doctors but not for patients. Machine learning algorithms are bound to become integrated with medical work and doctors will have to have the ability to interpret results from these algorithms. Patients on the other hand are not trained as doctors, and there’s no reason to bring them into the diagnostic process. As Ted has told me in the past, “Whether you like it or not, artificial intelligence is coming. You can’t stop it.” I do believe that this is true. Healthcare is no different and will continue to integrate algorithms and automation further into the field. Doctors will have to adapt to these changes and should be required to understand the outputs of these kinds of machines. Therefore, they should absolutely be given access to individual patient probability ranges and be expected to interpret the results as they see fit. The same burden does not fall on patients’ shoulders, and they should not be given access to complex outputs that could be easily misinterpreted.
In response to: “Should we require disclosure of algorithm details to oversight bodies such as the Food and Drug Administration or the Office of the National Coordinator?”
1). Ted: No
Knowing the details of the modeling are different from knowing the details of the validation and verification, which I strongly support. Disclosing the “special sauce” can harm incentives for innovation and model developers must eat too. Also, disclosure might not help since the methods are opaque. How would this kind of oversight look? Smart people understand their algorithms and work to improve them on a rigorously-tested basis. Model simplicity, transparency, and disclosure are desirable but not always critical.
2). Emily: Yes
I do think that the FDA should provide oversight for these algorithms, although I do not think that the Office of the National Coordinator needs to include these when certifying Electronic Health Records. The risks from these algorithms are similar to risks from medical devices. It is irresponsible to let them operate and be introduced to a working hospital without knowing how well they work, what verification and validation process was employed, or have any post-market surveillance of harm or risks of harm. If we throw caution to the wind and let these companies operate unchecked, we have no idea what kinds of harm can happen to real patients. In order for the FDA to provide adequate oversight, they need to know details about how the proprietary algorithms work and need to include someone with technical expertise in the review process, and specifically statistical and informatics expertise.
3). C.J.: Some disclosure
I fully support the power of innovation and believe that companies deserve to control their own intellectual property. The field of healthcare technology and pharmaceuticals could not have advanced as far as it has today without the power of competition. However, there must be some form of regulation by a governing body in the development of these kinds of algorithms. It is dangerous to only let companies tell you what their products do. A recent example of this danger can be seen in the documentary titled The Inventor: Out for Blood in Silicon Valley. This documentary describes a Silicon Valley blood testing company that claimed to be able to provide customers with a plethora of direct-to-consumer health information from a single drop of blood. In reality, the technology they claimed to have created did not exist, and the company was intentionally diluting blood samples and using them on market-available IBM blood testing machines, sending back incorrect results to consumers in Arizona. This was patient fraud of the highest degree. When a company is allowed to operate as a complete black box, patients are the ones who must suffer as a result. Oversight and regulation should be required for companies developing algorithms without forcing them to make their code open-source.
Round 3 results of voting, which also indicated the winner for the entire debate (out of approximately 90 attendees) were:
C.J.: 16/41 (39.0%)
Emily: 13/41 (31.7%)
Ted: 10/41 (24.3%)
3-way tie: 2/41 (4.9%)
DISCUSSION
Overall, this debate highlighted some issues to consider with respect to oversight, use, and maintenance of AI-based predictive algorithms for clinicians, patients, and caregivers. The voting results indicated: 1) a willingness to allow use these kinds of algorithms, even when their recommendations are not transparent, for real-time clinical use with low stakes decisions such as initial screening for conditions, 2) increased concerns for disseminating auditory alarms to caregivers in a hospital setting based on these algorithms in a way that increases the noise burden from false and non-actionable alarms, and 3) an interest in some level of oversight and caution from the FDA as well as some level of disclosure from companies that would likely be incentivized to over-claim the accuracy of their products or hide how algorithms work without being forced to do so.
ACKNOWLEDGMENTS
This project was supported by the Institute for the Design of Environments Aligned for Patient Safety (IDEA4PS) at The Ohio State University which is sponsored by the Agency for Healthcare Research & Quality (P30HS024379). The authors’ views do not necessarily represent the views of AHRQ. We thank Dr. Nathalie Maitre at Nationwide Children’s Hospital for her contributions with funding a related project and providing intellectual leadership and Dr. Rajiv Ramnath for contributions in understanding appropriate deep learning modeling method and for mentoring Qiwei Yang.
REFERENCES
- Boehmer KR, Egginton JS, Branda ME, Kryworuchko J, Bodde A, Montori VM, & LeBlanc A (2014). Missed opportunity? Caregiver participation in the clinical encounter. A videographic analysis. Patient education and counseling, 96(3), 302–307. [DOI] [PubMed] [Google Scholar]
- Core MG, Lane HC, Van Lent M, Gomboc D, Solomon S, & Rosenberg M (2006, July). Building explainable artificial intelligence systems. In AAAI (pp. 1766–1773). [Google Scholar]
- Dorj UO, Lee KK, Choi JY, & Lee M (2018). The skin cancer classification using deep convolutional neural network. Multimedia Tools and Applications, 1–16. [Google Scholar]
- Hansen CJ, Allen TT, Bitan Y, Patterson ES (2019, in press). Algorithmic Transparency for Predicting In-Hospital Mortality: A Targeted Literature Review. In Smart Thinking: Proceedings of the 14th International Naturalistic Decision Making Conference. [Google Scholar]
- Huang D, & Allen TT (2005). Design and analysis of variable fidelity experimentation applied to engine valve heat treatment process design. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2), 443–463. [Google Scholar]
- Meyer J, & Sheridan TB (2017). The intricacies of user adjustments of alerting thresholds. Human factors, 59(6), 901–910. [DOI] [PubMed] [Google Scholar]
- Patterson ES, Hritz C, Moffatt-Bruce SD (in press, 2019). Reducing alert fatigue for comfort care and palliative care hospital patients In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care. Sage India: New Delhi, India: SAGE Publications. [Google Scholar]
- Patterson ES, Hritz C, Gebru L, Patel K, Yamokoski T, & Moffatt-Bruce SD (2018, June). Use Preferences for Continuous Cardiac and Respiratory Monitoring Systems in Hospitals: A Survey of Patients and Family Caregivers. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care (Vol. 7, No. 1, pp. 123–128). Sage India: New Delhi, India: SAGE Publications. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song T, Qu XF, Zhang YT, Cao W, Han BH, Li Y, … & Da Cheng H (2014). Usefulness of the heart-rate variability complex for predicting cardiac mortality after acute myocardial infarction. BMC cardiovascular disorders, 14(1), 59. [DOI] [PMC free article] [PubMed] [Google Scholar]
