Abstract
Background
Electronic health records (EHRs) have contributed to increased workloads for clinicians. Ambient artificial intelligence (AI) tools offer potential solutions, aiming to streamline clinical documentation and alleviate cognitive strain on healthcare providers.
Objective
To assess the clinical utility of an ambient AI tool in enhancing consultation experience and the completion of clinical documentation.
Methods
Outpatient consultations were simulated with actors and clinicians, comparing the AI tool against standard EHR practices. Documentation was assessed by the Sheffield Assessment Instrument for Letters (SAIL). Clinician experience was measured through questionnaires and the NASA Task Load Index.
Results
AI-produced documentation achieved higher SAIL scores, with consultations 26.3% shorter on average, without impacting patient interaction time. Clinicians reported an enhanced experience and reduced task load.
Conclusions
The AI tool significantly improved documentation quality and operational efficiency in simulated consultations. Clinicians recognised its potential to improve note-taking processes, indicating promise for integration into healthcare practices.
Keywords: Artificial intelligence, Ambient, Consultation, Clinical documentation, Electronic health records
Introduction
There is great interest in artificial intelligence (AI) to improve healthcare practices.1 Whilst the introduction of electronic health records (EHRs) intended numerous proposed benefits, including improved clinical documentation,2 there are adverse effects associated with increased documentation workload, increasing cognitive load, and physician burnout.3 With the development of large language models (LLMs) (e.g. GPT-4), the potential of these tools in healthcare is being explored.4, 5, 6 However, the challenge is to develop AI tools that can integrate with existing EHR architecture and clinical workflows, while fulfilling appropriate information governance and security requirements.7
This study aimed to evaluate the clinical utility of an end-to-end ambient AI tool in a consultation to capture clinical information and improve clinical documentation. This study is the second phase of a multiphase study (Fig. 1). We hypothesised that there would be an improvement in the quality of notes and letters produced from consultations using the AI tool compared to standard practice and that the use of an AI tool would improve the quality and reduce the length of the consultation.
Fig. 1.
This paper reports on findings from the second phase of a multiphase trial of an ambient AI tool. The results from this phase (Phase 2) have informed the protocol design for Phase 3 (currently ongoing), in turn, the results of Phase 3 will inform the Phase 4 protocol.
Methods
Following the validation of a minimum viable product in a non-clinical ‘sandbox’ environment in Phase 1, we used an AI tool specifically designed to ambiently listen and summarise real-world clinical consultation audio into a clinic note and letter, using a pipeline of AI models including speech-to-text and LLMs.8
Converting the transcript into a medical note involves using an LLM (GPT-4 32K) guided by a ‘prompt’ (specifically a standardised note template and style guidance) to generate a medical note - an architecture diagram is shown in Fig. 2.
Fig. 2.
Speech is recorded by the AI tool on the user's device. Before the recording is transferred to a local region cloud provider for transcription, the Personal Health Information (PHI) relating to the patient is redacted. The PHI is reinstated and the transcription is used to generate a clinic note and letter on the user's device, these can be copied and pasted into the EHR patient record.
We evaluated the AI tool against current standard practice, using a Proof of Concept (POC) EHR environment (POC of the EHR used by the study's clinicians) for note and letter writing, in a single day. We used dummy patient data with professional medical actors playing the role of patients and parents in association with real clinicians from Great Ormond Street Hospital (GOSH) in a simulated clinic environment, mimicking real-world outpatient consultations. This allowed direct comparison of consultations with and without the use of the AI tool. We assessed the scored quality of clinical documentation (quantitative and qualitative assessment), clinician and parent/carer (actor) experience, and operational efficiency, as well as applicability to multiple contributors (such as parent and child).
Eight experienced clinicians carried out simulated consultations, independently selected by the study team for their breadth of specialty and use of technology. The pool included five medical consultants and three allied health professionals, all highly experienced with weekly outpatient clinics. Each clinician was assigned a station corresponding to their specialty and used standard hospital laptops operating the EHR with mock patient records. Actors role-playing patient/parent pairings rotated between stations. The sessions consisted of a ‘control’ rotation, in which clinicians conducted three consultations using the EHR as they would in practice, and an ‘intervention’ rotation, in which clinicians conducted three consultations with the AI tool and transferred final notes and clinic letters into the EHR (Fig. 3). 10-min training sessions on the AI tool were provided by the study team before the intervention rotations. Clinicians were allocated 20-min sessions to complete each of their consultations and produce a note and letter.
Fig. 3.
A visual representation of the rotation study structure for the morning and afternoon sessions, n = 4 clinicians in each session and a schedule used for the study (‘NO’ referring to the ‘Station break’). Patient/parent actor pairs rotated clockwise around the stations, moving at the conclusion of each 20-min consultation window until they had visited all four stations. Clinicians and their corresponding human timers remained stationary for the full rotation and had one break during the rotation.
Occupational therapy, oncology, cardiology, speech and language therapy (cleft), respiratory, urology, dermatology, and speech and language therapy (feeding) specialties were represented. Fig. 3 provides a visual representation of the rotation study structure. At each station, an observer was responsible for recording timed variables. Three pairs of actors participated in each session. Prior to participation, actors received scripts for the assigned patients (created on the POC EHR). Forty-eight dummy patient stories (clinical backgrounds) were created and matched for age and complexity of clinical condition across both the EHR and the AI tool groups. Clinicians were blinded and provided with a simulated standard referral letter summarising the reason for clinical review.
The study assessed the quality of documentation produced concurrently. Each consultation required two completed documents, one clinic letter and one clinic note, within the allocated 20-min appointment; clinicians were alerted 15-min into the consultation, enabling them to finish and finalise a note and letter. For the AI tool, clinicians were instructed to review the documentation and make edits before transferring to the EHR. Quantitative assessment of documentation quality was completed by scoring notes and letters using a previously validated tool, the Sheffield Assessment Instrument for Letters (SAIL).9 SAIL comprises 20 questions, each scoring 1 or 0 and a final question designed to elicit a global rating scale of 1 to 10 such that documents will receive a score between 0 and 30, with 30 being the best possible score. Study participants were not made aware that SAIL would be used to assess their documentation. Each document was assessed using SAIL by two experienced clinicians, neither of whom was present on the simulation day - providing independent assurance. The assessors worked to reach a consensus, creating a multi-professional appraisal of each document. Documents were blinded - assessors were unaware of whether they were reviewing an EHR or an AI tool document. Once complete, a total score for each assessment was calculated (see Figs. 4-6 in appendix). To improve data visualisation total scores were categorised <18 - Very poor, 19–21 - Poor, 22–24 - Fair, 25–27 - Good, 28–30 – Very good.
Clinician experience was evaluated using an eight-item bespoke questionnaire, each question was rated on a five-point Likert scale (strongly agree - strongly disagree) (Fig. 7 in appendix). Questions were asked about the use and experience of both EHR alone and the AI tool. Clinicians were also asked to complete, for both the EHR and AI tool, the NASA Task Load Index, a six-item, subjective, multi-dimensional tool that rates perceived workload (Fig. 8 in appendix). Questions were rated on a 10-point scale, with 1 representing the lowest and 10 representing the highest task load index.10 Parent-actors were also asked to rate their experience in a bespoke series of Likert-style questions (Fig. 9 in appendix). Questionnaire results were analysed using descriptive statistics (frequencies) and comparisons between EHR and the AI tool were made using non-parametric tests. For quantitative performance evaluation, we assessed the recording of two key time variables: in-room patient consultation time (minute:second), defined as the moment the patient enters the room to when the patient leaves the room, and time spent by the clinician conversing with the patient/family (min:sec). For each clinician, averages for the recorded variables were taken across three consultations with the EHR alone and three consultations with the use of the AI tool. Mean averaged time calculations were generated from the timing data.
A post-session focus group was conducted with each set of clinicians to collect feedback about their experience, their perceived confidence in the technology, whether they thought the technology influenced their interaction with the ‘patient and family’, and willingness to use AI tools in real-world consultations. Two facilitators were present for each focus group. Participants wrote down three words that came to mind when thinking about using the AI tool. Focus groups were audio-recorded with verbal consent and the recordings were thematically analysed, with word clouds used to display the results. Ethical approval was not required since this was a service development evaluation that did not involve patients or staff in their clinical duties.11 The service evaluation was registered formally at GOSH.
Results
Forty-seven consultations were performed during the simulation, 23 EHR-alone consultations (one incomplete), and 24 using the AI tool (Table 1 in appendix). In six standard EHR consults, letters were not completed because clinicians ran out of time, however, notes were available for all EHR consults. A total of five AI tool-assisted documents were also not completed (1 x letter, 4 x notes). One missing note/letter pair was due to a technical failure in one consultation (accounting for 1 x note and 1x letter) and three notes were lost to review due to human error the notes were not saved to the EHR system appropriately.
The SAIL scores for the AI tool-assisted notes and letters were higher than those created using the standard EHR, for both letters (proportion scoring >25; 70% with AI tool versus 29% without; z = 2.5, p = 0.012) and notes (proportion scoring >25; 100% with AI tool versus 43% without; z = 2.9, p = 0.004; Figs. 4-6), indicating improved completeness and content within the allocated time of the study.
Consultation length averages were calculated from n = 42 consultations conducted by n = 7 clinicians (Table 2 in appendix); one clinician was omitted due to technical difficulties preventing the collection of timing data, therefore six consults from that specific clinician were not available. Consultations using the AI tool were significantly shorter overall compared to those with the EHR alone (delta (-193 s), t = 2.247, p = 0.03), equating to a 26.3% time-saving. Mean total time spent talking with patients/family was not significantly different (delta (-28 s), t = 1.084, p = 0.32).
The AI tool functioned well in scenarios with multiple speakers (all consultations had patient/parent actor pairings), without any reported issues by participants attributable to confusing speakers. Whilst all consultations were in English, some speakers were chosen to have non-native and regional accents (native Danish, French, and North American-English speakers) and conversations were carried out in loud environments (the simulated consultations were conducted in an open room with three consultations running simultaneously at any given time). The AI tool only selects relevant information pertinent to the medical note. This was flagged as a useful function by the clinicians since it allowed them to have engaging conversations with patients, allowing rapport to be built, without any clinically irrelevant information being included in the note or letter (an example being a discussion regarding superheroes, which was not included in the documentation). Feedback from the questionnaires demonstrated improved clinician experience and reduced computer disruption with the use of the AI tool compared with EHR. For instance, 100% of clinicians reported being able to provide patients with their full attention whilst using the AI tool versus 66% using the EHR alone, and 94% agreed that they had a positive experience in clinic versus 71% using the EHR alone (Fig. 7 in appendix). On the NASA Task Load Index, there was a perceived improvement in five of the six metrics (Fig. 8 in appendix). The greatest improvement can be seen in the metric assessing if clinicians felt that the pace of the clinic was hurried or rushed, in general, they felt less hurried or rushed when using the AI tool compared to the EHR alone. Parent-actors also reported some improvement in their experience in the AI scenario, 87% strongly agreed that the clinician gave their full attention to the patient and parent compared to 75% with the EHR scenario (Fig. 9 in appendix).
Finally, the focus groups suggested positive experiences using the AI tool. Participants identified positive attributes of AI, linked to ease of use, a simple user interface, time efficiency, structured formatting of the clinic note, and the aid to note taking. The word cloud highlighted the words ‘potential’, ‘quick’ and ‘amazing’ as most commonly selected (Fig. 10 in appendix). Concerns focused on the accuracy of information generated by the AI tool, related to occasional additions or omissions of information and resulting inaccuracies in the notes and letters. Some participants commented on the formality of the letters and a loss of narrative through information ordering. Customisation to individual or departmental needs was identified as something that would be important for future use, including in relation to the appropriate handling of sensitive information. Clinicians reported having trust issues with the technology initially but their confidence improved over time.
Discussion
The results of our Phase Two simulation study demonstrate that the use of ambient technology has the potential to significantly improve letter and note quality beyond standard EHR capability. In addition to improved documentation scores, there were associated time savings of 26.3% across all consultations without a significant reduction in patient-facing time. The potential benefits could multiply with daily use, across multiple clinics. This suggests that a significant patient throughput can be achieved without increasing staff requirements, while also improving patient/clinician experience. Clinician feedback was very positive, with participants articulating a willingness to use the technology in future consultations and observing a potential to enhance their note-taking process and improve efficiency. Parent-actor feedback indicated some improvement in experience with the AI tool.
In addition to positive experiential feedback, SAIL assessment of note and letter documentation suggests that the AI tool also positively influences clinic documentation quality, with a greater than two-fold increase in quality seen for both notes and letters with the introduction of the AI tool, within the confines of the study. This is reflective of a previous study whereby the introduction of structured EHR documentation recording increased overall documentation quality.12 Minimising variation in documentation quality can lower the risk of patient harm through misinterpretation or missed information, thus the AI tool has the potential to reduce this risk through standardising documentation quality. During the SAIL assessment, access to AI tool transcripts was not available - this is a limitation that will be corrected in future phases as it cannot be determined whether missing information was due to the AI tool or because the clinician did not ask certain questions. In future studies within clinical practice, we could improve our methodology by including an additional evaluation step. We could capture an AI tool-produced letter concurrently and directly compare the content to the finalised letter, assessing for missing and incorrect information which requires the clinician to edit the AI tool documentation.
Task load was reduced with the AI tool compared to with the EHR alone, however, clinicians also emphasised the need for refinement and customisation to align with their specific needs and preferences. Complex EHR user interfaces displaying large volumes of information increase cognitive load among clinicians, which can lead to burnout.13 Coupling the simple navigation and user interface of the AI tool with medical specialty customisation would further enhance the clinician's experience.
Whilst there was no clinically significant erroneous content identified in this study, 19 dummy SOAP (Subjective, Objective, Assessment and Plan) note examples generated as part of a testing cohort outside of the study from a published primary care data set (PriMock), were evaluated by a human clinician for evidence of content produced in the note not present in the transcript (so-called ‘hallucinations’) as part of the pre-study development phase. Hallucinations were classified by a clinician as either minor or major. Major hallucinations were determined as content that would materially alter the diagnosis, treatment or prognostication of the patient if left uncorrected, minor hallucinations were content that would make no such impact. Across all examples, five notes contained at least one minor (clinically non-significant) hallucination, and two contained at least one major (clinically significant) hallucination. With an average of ∼50 clinical entities per note, this suggests a possible minor hallucination rate in this context of 0.5% and a major hallucination rate of 0.21%. A post-study development study identified several key areas to reduce hallucinations, and a repeat test reduced major hallucination rates to 0% and minor hallucinations to 0.21%. This highlights that it remains the clinician's professional responsibility to check documentation prior to final approval, and this must be reinforced in all training for subsequent studies and future deployment of such tools. This is current practice when using a human or AI scribe to document consultations. A limitation of this testing was the lack of a diverse and generalisable dataset for the cohort studied (primary care to quaternary paediatric care setting). Future research should encompass consultations across varying patient populations, providing a range of accents, clinical scenarios and people present for testing the AI-tool. This can be achieved by conducting the study at different clinical sites across a multicultural region.
The study also highlighted that, despite AI assistance, human error may result in a failure to achieve full benefit, as with any new software system. Before further evaluation or deployment of such tools, clinicians must be fully trained and competent in the required procedures. In addition, observers anecdotally noted that without the AI tool, clinicians frequently interchanged between speaking with the patient/parent and typing during the consultation. Reduction in task switching appeared to be a major contributor to time saved during the consultations, an area to evaluate in future studies. Finally, it should be noted that whilst the patient scenarios were realistic and spanned multiple clinical specialties, they may not represent highly complex consultations with large pre-existing volumes of medical record information; the accuracy of notes and letters should be further evaluated across a wide range of consultation types and complexities.
The use of speech recognition for dictating clinical documentation is well established, particularly for radiology and pathology reporting, and is now becoming more widespread across healthcare, including general clinical tasks and nursing.14,15 However, traditional speech recognition requires composition by the clinician (task switching as with typing on a computer keyboard) and does not in itself improve the quality of documentation or reporting.16 The development of ambient AI technologies beyond simple speech-to-text allows such tools to work unobtrusively in the background, whilst identifying and structuring relevant information and minimising subsequent requirements for document editing.
It should be noted that, in this context, AI is not being used to formulate diagnoses or to recommend treatments and management strategies. Although AI has the potential to support and enhance these elements of a clinical consultation, it was agreed that this functionality should not be incorporated into clinical practice without extensive scrutiny and detailed evaluation for accuracy, relevance and avoidance of bias. A framework to evaluate the efficacy of such tools is being developed.
Based on these findings, further studies should evaluate the use of such tools in real-world consultations versus current standard practice in a live EHR environment for clinical documentation across multiple disciplines and settings.
Declaration of competing interest
The authors have no conflicts of interest.
Footnotes
This article reflects the opinions of the author(s) and should not be taken to represent the policy of the Royal College of Physicians unless specifically stated.
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.fhj.2024.100157.
Appendix. Supplementary materials
References
- 1.Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
- 2.What are the advantages of electronic health records? | HealthIT.gov. https://www.healthit.gov/faq/what-are-advantages-electronic-health-records (accessed April 22, 2019).
- 3.Muhiyaddin R, Elfadl A, Mohamed E, et al. Electronic health records and physician burnout: a scoping review. Stud Health Technol Inform. 2022;289:481–484. doi: 10.3233/SHTI210962. [DOI] [PubMed] [Google Scholar]
- 4.Nedbal C, Naik N, Castellani D, Gahuar V, Geraghty R, Somani BK. ChatGPT in urology practice: revolutionizing efficiency and patient care with generative artificial intelligence. Curr Opin Urol. 2023 doi: 10.1097/MOU.0000000000001151. published online Nov 14. [DOI] [PubMed] [Google Scholar]
- 5.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. 2022; published online Dec 26. 10.48550/arXiv.2212.13138.
- 6.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Dig Health. 2023;2 doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yu P, Xu H, Hu X, Deng C. Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration. Healthcare. 2023;11:2776. doi: 10.3390/healthcare11202776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.AI for Every Doctor - Tortus. 2023; published online June 25. https://tortus.ai/ (accessed Nov 30, 2023).
- 9.Crossley JGM, Howe A, Newble D, Jolly B, Davies HA. Sheffield Assessment Instrument for Letters (SAIL): performance assessment using outpatient letters. Med Educ. 2001;35:1115–1124. doi: 10.1046/j.1365-2923.2001.01065.x. [DOI] [PubMed] [Google Scholar]
- 10.Hart S. NASA Ames Research Center Moffett Field; 1986. NASA Task Load Index. [Google Scholar]; https://ntrs.nasa.gov/api/citations/20000021487/downloads/20000021487.pdf.
- 11.Is my study research? https://www.hra-decisiontools.org.uk/research/ (accessed Nov 30, 2023).
- 12.Ebbers T, Kool R, Smeele L, et al. The impact of structured and standardized documentation on documentation quality; a multicenter, retrospective study. J Med Syst. 2022;46(7):46. doi: 10.1007/s10916-022-01837-9. –46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kroth P.J, Morioka-Douglas N, Veres S, et al. Association of electronic health record design and use factors with clinician stress and burnout. JAMA Netw Open. 2019;2(8) doi: 10.1001/jamanetworkopen.2019.9609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Aldosari B, Babsai R, Alanazi A, Aldosari H, Alanazi A. The progress of speech recognition in health care: surgery as an example. Stud Health Technol Inform. 2023;305:414–418. doi: 10.3233/SHTI230519. [DOI] [PubMed] [Google Scholar]
- 15.Joseph J, Moore ZEH, Patton D, O'Connor T, Nugent LE. The impact of implementing speech recognition technology on the accuracy and efficiency (time to complete) clinical documentation by nurses: a systematic review. J Clin Nurs. 2020;29:2125–2137. doi: 10.1111/jocn.15261. [DOI] [PubMed] [Google Scholar]
- 16.Blackley SV, Huynh J, Wang L, Korach Z, Zhou L. Speech recognition for clinical documentation from 1990 to 2018: a systematic review. J Am Med Inform Assoc. 2019;26:324–338. doi: 10.1093/jamia/ocy179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



