Abstract
Use of artificial intelligence in healthcare, such as machine learning-based predictive algorithms, holds promise for advancing outcomes, but few systems are used in routine clinical practice. Trust has been cited as an important challenge to meaningful use of artificial intelligence in clinical practice. Artificial intelligence systems often involve automating cognitively challenging tasks. Therefore, previous literature on trust in automation may hold important lessons for artificial intelligence applications in healthcare. In this perspective, we argue that informatics should take lessons from literature on trust in automation such that the goal should be to foster appropriate trust in artificial intelligence based on the purpose of the tool, its process for making recommendations, and its performance in the given context. We adapt a conceptual model to support this argument and present recommendations for future work.
Keywords: artificial intelligence, trust, algorithms, machine learning
INTRODUCTION
Artificial Intelligence (AI) has been defined as a “machine-based system that can, for a given set of human-defined objectives, make predictions, recommendations, or decisions.”1 Big data and increased computing power have allowed for more predictive yet complex AI approaches, such as machine learning.2,3 AI algorithms show promise in improving healthcare, but few of the algorithms developed are actively used in clinical practice.2–8
Trust in AI is an important but challenging problem.9 AI in healthcare often automates cognitively challenging tasks, such as predicting patients at risk for poor outcomes. Therefore, previous literature regarding automation holds lessons for AI in healthcare. Lee and See’s 2004 review, specifically, describes why trust is increasingly important as automated systems become more complex, “Trust guides reliance when complexity…make[s] a complete understanding of the automation impractical…By guiding reliance, trust helps to overcome the cognitive complexity people face in managing increasingly sophisticated automation.”10 A 2015 update to Lee and See’s review included 127 articles stemming from other settings (eg, military, transportation), suggesting that important literature regarding trust in automation has not been applied in healthcare.11
We argue informaticists should conceptualize trust in AI as a complex, dynamic entity with the goal of fostering appropriate trust based on the purpose of the AI, its process for making recommendations, and its performance in the given context. This perspective aims to contextualize Lee and See’s model for an informatics audience and advocate for systems supporting appropriate reliance on AI in combination with clinical expertise, with the ultimate goal of improving health outcomes.
A MODEL FOR APPROPRIATE RELIANCE IN AI
Figure 1 presents a conceptual model for developing appropriate reliance on AI. The model focuses on trust by the end user, or the person(s) who are intended to take some action based on the AI (typically, the health professional). In this model, response to AI involves 4 stages. First, people learn about the system and form beliefs. Next, they develop trust, which is a necessary step toward intention to use the AI. Finally, intention precedes actions (ie, using the system or not). Trust evolution is also affected by the individual, organizational, cultural, and environmental context. We can then evaluate trust appropriateness based on how well users’ reliance actions on the AI match the AI’s capabilities.
Figure 1.
Lee and See’s model for fostering appropriate trust on AI in medicine.
THE AI SYSTEM ATTRIBUTES
Purpose, process, and performance are key attributes for forming appropriate trust in AI.12,13 Purpose explains why the automation was developed. Conveying the purpose of an AI system can help prevent misuse, which could occur, for example, if a health professional uses an AI system intended to predict sepsis in adults to predict sepsis in children.
Process describes how the AI operates. Process-related information may include data inputs, analysis procedures, and data outputs. Providing this information to the user is important to the initial establishment of trust.9,14 The World Health Organization and others have advocated for improved explainability, whereby the AI describes how algorithms produced their predictions.15–17 Previous studies have indicated that users tend to trust automation if its process can be understood.18 The concept of explainability poses a challenge with machine learning and deep learning methods, which do not produce explanations for their findings without post hoc interpretation.19 However, qualitative research by authors of this perspective has indicated that the issue is not making deep learning itself more understandable, but instead providing enough information to facilitate appropriate trust and decision-making. Specifically, users reported wanting information on the high-level drivers of algorithms, such as whether they included both clinical and social risk factors, and which factors were most important in driving a patient’s risk score. This information was considered critical to determining what resources or interventions might be most appropriate for the patient.9
Performance refers to what the AI does, and how well the system supports users’ goals. AI performance is commonly measured by statistics such as area under the curve (AUC). It is unclear, however, if these performance statistics are conveyed to users. In addition, the AUC describes the system’s overall performance, and thus only indirectly addresses whether a health professional should trust its recommendation for a specific patient. For the health professional, it would be more helpful to show how often a prediction was correct for a specific type of patient (eg, the PPV [positive predictive value]). Yet health professionals struggle to understand concepts such as PPV and NPV (negative predictive value), and patients have even further difficulties.20,21 A more intuitive measure, such as a clearly described record of true positives versus false positives for a specific patient group (ie, a likelihood ratio), might be more meaningful.9
CONTEXT AND ITS IMPACT ON TRUST FORMATION AND EVOLUTION
Trust development is affected by individual, organizational, cultural, and environmental factors as described in Table 1.
Table 1.
Description of factors impacting trust formation
Factor affecting trust | Description | Example(s) |
---|---|---|
Individual | Individuals have varied predispositions to trust a system based on experiential, psychological, and knowledge-based differences. |
|
Organizational | Interpersonal communication, organizational culture, leadership, and the trust in the person or group who developed the system can all impact trust |
|
Cultural | Individual cultural differences may impact trust based on country of origin, role/profession (eg, physician, nurse, patient), age, race, or ethnicity. | |
Environmental (physical and external)28 | Physical and external environmental factors impact a user’s ability to utilize AI, understand system performance, and ultimately develop trust. The physical environment (eg, hospital layout, location of devices, resource access) may be less relevant to healthcare-related AI systems than physical automated systems in other domains (eg, military, transportation, industrial settings) or other patient safety-related problems (eg, patient falls, medication administration errors). Other external drivers (eg, governmental policies, activities of third-party organizations), however, may have important influences on trust. |
|
Trust is a dynamic process involving information assimilation, trust formation, intention formation, and actual reliance actions (Figure 1), so different contextual factors have greater impacts at different points. We mention these ever-evolving contextual factors to demonstrate that trust development is not a one-size-fits-all approach.29
MEASURING TRUST APPROPRIATENESS
The appropriateness of trust in AI can be assessed through 2 distinct yet related constructs—calibration and resolution (Table 2).
Table 2.
Description of constructs related to measuring trust appropriateness
Measurement construct | Description | Example |
---|---|---|
Calibration | The extent to which a user’s trust matches the performance of the system | Trust should be higher for better performing systems. For example, if an AI system has a high AUC, people should trust it frequently. If the system has a poor AUC, people should trust it less frequently. |
Resolution | The users’ ability to adapt their trust based on changing functions and goals, or how the AI performance changes over time30 | Resolution can be measured through: |
1) “functional specificity” or how the user adapts their trust based on the problem context. For example, a natural disaster, such as a pandemic or weather event, may change the data so dramatically that an algorithm’s recommendations may not be relevant for future forecasts. In other scenarios, AI may not perform as well for certain populations for which there is biased or insufficient data, and the user trust should shift accordingly. | ||
2) “temporal specificity” or how the users’ trust adapts over time as the decision support becomes more or less accurate. For AI that becomes more accurate over time as it gains more information about the patient, user trust should also adapt accordingly. Alternately, if the system becomes outdated, trust should decline with time. |
RECOMMENDATIONS AND FUTURE DIRECTIONS
Integration of the concepts presented here into AI systems will require a collaborative effort from informaticists, computer scientists, application developers, health organizations, and patient/family advisory groups. We provide recommendations for conceptualizing trust as a complex, dynamic concept where the goal is to achieve appropriate reliance on AI in conjunction with clinical expertise to facilitate treatment decisions that improve health outcomes.
Allow the user to view the purpose, process, and performance of the AI system
Provision of this information should be dynamic and flexible as different users may want different levels of detail, or need different levels of explanation (eg, patients vs. health professionals). Typically, this information should be provided within the AI system interface. Provision of this information should also adhere to appropriate principles regarding user-centered and inclusive design, ensuring the information can be easily found and understood by the various users.31 Barda et al provides a framework for user-centered design of machine learning predictions in healthcare.29 Patient-facing systems, for example, should use plain language, and provide options for having the text read aloud for less literate or visually impaired users. It is also important to ensure accessibility across different device types (eg, mobile device, tablet, laptop, or desktop).
Purpose descriptions may be most critical in the early implementation of an AI system and may be conveyed for first time users of the system. Safety-critical scenarios (ie, where misuse may result in patient harm) may require more engagement from users such as answering questions to demonstrate understanding of the system’s purpose. At minimum, the purpose should briefly describe why the system was created, by whom, for whom, using what training data (ie, from the local site or another), and appropriate and inappropriate scenarios for using the system. Ideally, the AI support systems would not display for inappropriate use cases. Following AI introduction, purpose information should remain available for new users and review by existing users. If developers update the purpose post-deployment, display windows regarding these updates upon the user's next entry to the system may be appropriate.
Process display guidance can be broken into system inputs (data) and outputs (recommendations). Data inputs should be reviewed early and iteratively throughout development. Review by end users (health professionals or patient advisory groups) and clinical informaticists may detect data inputs that are prone to missingness or quality issues, as well as information that may cause bias. AI that rely on historical data can be systematically biased based on age, race, or gender, and recent guidance from the WHO stresses the need for inclusive, equitable AI design.15,32 Displaying process-based inputs to stakeholders early to detect problematic inputs can mitigate potential biases in AI systems. One study, for example, gained feedback from health professionals to determine if model outputs at certain points were plausible or relevant.33 Reviewing inputs with experts prior to implementation may also foster trust with the larger group of end users once the system is deployed. Further, working with end users may help determine which inputs users find surprising, so that descriptions of these inputs may be provided and used as a means for demonstrating the utility of the system above-and-beyond clinical gestalt. Post-deployment, it may also be helpful to provide a means for all users to view a list of data inputs.
Displaying process outputs (decision support) involves providing explanation as to how the system arrived at its recommendation. One group, for example, developed an explainable AI system which doubled the correct identification of hypoxemia during surgery. The interface overlaid predictions with which features of the model have increased or decreased risk and provides related values (eg blood pressure, body mass index, etc.).34 Developments in computational methods, such as SHapley Additive exPlanations (SHAP values) which allow for post hoc quantification of each predictor in a model, have made the process of displaying the most important values more feasible.35
Performance metrics are essential to allowing users to trust systems appropriately. At minimum, displaying variables such as PPV and NPV may be beneficial. As described, health professionals and patients can struggle to understand these concepts,20,21 so future research should continue to assess creative, understandable ways for conveying system performance. To the extent possible, performance information should be dynamic and tailored to the context at various levels, for example, the institution, department/unit, and patient variables such as demographics.
Work with local sites to adapt purpose-, process-, and performance-related information display appropriately
Often, AI systems may start with baseline models that need to be adapted to different institutions.33 The same is true for designing an AI interface that facilitates appropriate trust formation. Updating the AI interface design may also be done in parallel with the aforementioned activities related to determining the display for process inputs. Working with end users at local sites will also allow developers to understand and account for contextual differences.29 As described, organizational, cultural, and environmental factors impact the formation and evolution of trust. Incorporating diverse end users from local sites will help understand these contextual differences and how they may be accounted for in the AI interface. Utilizing information design concepts such as overview, filter, details on demand, will also allow different individuals to adaptively gain the information needed to develop appropriate trust.36
Redefine how we measure success in AI
Measures of success of AI have hinged on model parameters such as PPV, NPV, AUC, and Akaike information criterion (AIC) that account only for model performance under controlled conditions and not how people utilize the models in practice. Calculating and monitoring statistics such as calibration and temporal and functional specificity (to assess resolution) will help developers understand if end users appropriately trust AI systems. Temporal specificity can help monitor AI systems over time to detect if performance improves or degrades, also presenting an improvement to current practice where performance statistics are only calculated pre-deployment. It may also be helpful to investigate how these metrics may be productively displayed to end users to help them improve the appropriateness of their trust as has been done with systems in other domains.37 Similar examples of showing performance information can also be seen in other aspects of healthcare, such as ordering of expensive imaging studies.38 Others have also proposed decision-based statistics such as the “net reclassification improvement,” which measures if utilization of a certain model improves the users’ ability to identify true positive cases.39
CONCLUSION
AI approaches, such as the predictive power of machine learning, hold immense promise in advancing the future of healthcare. However, AI must work collaboratively and seamlessly with the people inherently embedded in our health systems. This requires that the means by which AI are developed and interfaces through which AI are delivered foster appropriate trust by the people working in our health systems (both professionals and patients). Conceptualizing trust such that sophisticated human minds should appropriately trust and collaboratively work with AI will help us realize the potential of human-AI teams in advancing health outcomes.
AUTHOR CONTRIBUTIONS
NCB proposed the initial concept, which was further articulated with LLN, CR, and JSA. NCB drafted the manuscript with substantial input from all other authors.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
No new data were generated or analysed in support of this research.
References
- 1.Organisation for Economic Co-operation and Development. What are the OECD principles on AI? OECD Observer 2019; doi: 10.1787/6ff2a1c4-en. [Google Scholar]
- 2. Wang F, Casalino LP, Khullar D. Deep learning in medicine—promise, progress, and challenges. JAMA Intern Med 2019; 179 (3): 293–4. [DOI] [PubMed] [Google Scholar]
- 3. Ching T, Himmelstein DS, Beaulieu-Jones BK, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15 (141): 20170387. doi: 10.1098/rsif.2017.0387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Shah NH, Milstein A, Bagley PhD SC. Making machine learning models clinically useful. JAMA 2019; 322 (14): 1351. [DOI] [PubMed] [Google Scholar]
- 5. Amarasingham R, Patzer RE, Huesch M, Nguyen NQ, Xie B. Implementing electronic health care predictive analytics: considerations and challenges. Health Aff (Millwood) 2014; 33 (7): 1148–54. [DOI] [PubMed] [Google Scholar]
- 6. Levy-Fix G, Kuperman GJ, Elhadad N. Machine learning and visualization in clinical decision support: current state and future directions. arXiv [csLG] 2019. http://arxiv.org/abs/1906.02664. [Google Scholar]
- 7. Girosi F, Mann S, Kareddy V. Narrative Review and Evidence Mapping: Artificial Intelligence in Clinical Care. Washington, DC: Patient-Centered Outcomes Research Institute; 2021. [Google Scholar]
- 8. Grossman Liu L, Rogers JR, Reeder R, et al. Published models that predict hospital readmission: a critical appraisal. BMJ Open 2021; 11 (8): e044964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Benda NC, Das LT, Abramson EL, et al. “How did you get to this number?” Stakeholder needs for implementing predictive analytics: a pre-implementation qualitative study. J Am Med Inform Assoc 2020; 27 (5): 709–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Lee JD, See KA. Trust in automation: designing for appropriate reliance. Hum Factors 2004; 46 (1): 50–80. [DOI] [PubMed] [Google Scholar]
- 11. Hoff KA, Bashir M. Trust in automation: integrating empirical evidence on factors that influence trust. Hum Factors 2015; 57 (3): 407–34. [DOI] [PubMed] [Google Scholar]
- 12. Zuboff S. In the Age of the Smart Machine. New York, NY: Basic Books; 1988. [Google Scholar]
- 13. Lee J, Moray N. Trust, control strategies and allocation of function in human-machine systems. Ergonomics 1992; 35 (10): 1243–70. [DOI] [PubMed] [Google Scholar]
- 14. Reale C, Novak LL, Robinson K, et al. User-centered design of a machine learning intervention for suicide risk prediction in a military setting. AMIA Annu Symp Proc 2020; 2020: 1050–8. [PMC free article] [PubMed] [Google Scholar]
- 15.World Health Organization. Ethics and Governance of Artificial Intelligence for Health: WHO Guidance. Geneva: World Health Organization; 2021.
- 16. Gordon L, Grantcharov T, Rudzicz F. Explainable artificial intelligence for safe intraoperative decision support. JAMA Surg 2019; 154 (11): 1064–5. [DOI] [PubMed] [Google Scholar]
- 17. Deeks A. The judicial demand for explainable artificial intelligence. Columbia Law Rev 2019; 119 (7): 1829–50. [Google Scholar]
- 18. Sheridan TB. Telerobotics, Automation, and Human Supervisory Control. Cambridge, MA: MIT Press; 1992. [Google Scholar]
- 19. Wang F, Kaushal R, Khullar D. Should health care demand interpretable artificial intelligence or accept “Black Box” medicine? Ann Intern Med 2020; 172 (1): 59–60. [DOI] [PubMed] [Google Scholar]
- 20. Ferguson E, Starmer C. Incentives, expertise, and medical decisions: testing the robustness of natural frequency framing. Health Psychol 2013; 32 (9): 967–77. [DOI] [PubMed] [Google Scholar]
- 21. Ottley A, Peck EM, Harrison LT, et al. Improving Bayesian reasoning: the effects of phrasing, visualization, and spatial ability. IEEE Trans Vis Comput Graph 2016; 22 (1): 529–38. [DOI] [PubMed] [Google Scholar]
- 22. Zhang Z, Genc Y, Xing A, Wang D, Fan X, Citardi D. Lay individuals’ perceptions of artificial intelligence (AI) ‐empowered healthcare systems. Proc Assoc Inf Sci Technol 2020; 57 (1): e326. doi: 10.1002/pra2.326. [Google Scholar]
- 23. Greenhalgh T, Robert G, Macfarlane F, Bate P, Kyriakidou O. Diffusion of innovations in service organizations: systematic review and recommendations. Milbank Q 2004; 82 (4): 581–629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Araujo T, Helberger N, Kruikemeier S, de Vreese CH. In AI we trust? Perceptions about automated decision-making by artificial intelligence. AI Soc 2020; 35 (3): 611–23. [Google Scholar]
- 25. Thurman N, Moeller J, Helberger N, Trilling D. My friends, editors, algorithms, and I. Digit J 2019; 7 (4): 447–69. [Google Scholar]
- 26. Smith A. Public Attitudes Toward Computer Algorithms. Pew Research Center; 2018. https://www.pewresearch.org/internet/2018/11/16/public-attitudes-toward-computer-algorithms/ Accessed July 13, 2021.
- 27. Karvonen K, Cardholm L, Karlsson S. Designing trust for a universal audience: a multicultural study on the formation of trust in the Internet in the Nordic Countries. In: International Conference on Universal Access in HCI; 2001: 1078–82; New Orleans, LA. [Google Scholar]
- 28. Carayon P, Wooldridge A, Hoonakker P, Hundt AS, Kelly MM. SEIPS 3.0: Human-centered design of the patient journey for patient safety. Appl Ergon 2020; 84: 103033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Barda AJ, Horvat CM, Hochheiser H. A qualitative research framework for the design of user-centered displays of explanations for machine learning model predictions in healthcare. BMC Med Inform Decis Mak 2020; 20 (1): 257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Duez PP, Zuliani MJ, Jamieson GA. Trust by design: information requirements for appropriate trust in automation. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research. CASCON ’06. IBM Corp.; 2006:9–es. [Google Scholar]
- 31. Keates S. BS7000-6:2005 Design management systems. Managing inclusive design. Guide, 2005. http://gala.gre.ac.uk/id/eprint/12997/ Accessed September 25, 2020.
- 32. O’Neil C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown; 2016. [Google Scholar]
- 33. Oh J, Makar M, Fusco C, et al. A generalizable, data-driven approach to predict daily risk of clostridium difficile infection at two large academic health centers. Infect Control Hosp Epidemiol 2018; 39 (4): 425–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Lundberg SM, Nair B, Vavilala MS, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2018; 2 (10): 749–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lubo-Robles D, Devegowda D, Jayaram V, Bedle H, Marfurt KJ, Pranter MJ. Machine learning model interpretability using SHAP values: Application to a seismic facies classification task. In: SEG Technical Program Expanded Abstracts 2020. Society of Exploration Geophysicists; 2020. doi: 10.1190/segam2020-3428275.1. [Google Scholar]
- 36. Shneiderman B. The eyes have it: a task by data type taxonomy for information visualizations. In: Bederson BB, Shneiderman B, eds. The Craft of Information Visualization. San Francisco, CA: Morgan Kaufmann; 2003: 364–71. [Google Scholar]
- 37. Cring EA, Lenfestey AG. Architecting Human Operator Trust in Automation to Improve System Effectiveness in Multiple Unmanned Aerial Vehicles (UAV), 2009. https://scholar.afit.edu/etd/2516/ Accessed January 14, 2021.
- 38. Halpern DJ, Clark-Randall A, Woodall J, Anderson J, Shah K. Reducing imaging utilization in primary care through implementation of a peer comparison dashboard. J Gen Intern Med 2021; 36 (1): 108–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology 2010; 21 (1): 128–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No new data were generated or analysed in support of this research.