Skip to main content
Plastic and Reconstructive Surgery Global Open logoLink to Plastic and Reconstructive Surgery Global Open
. 2025 Jan 16;13(1):e6450. doi: 10.1097/GOX.0000000000006450

Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations

Sarah A Mess *,†,‡,, Alison J Mackey §, David E Yarowsky
PMCID: PMC11737491  PMID: 39823022

Summary:

Artificial intelligence (AI) scribe applications in the healthcare community are in the early adoption phase and offer unprecedented efficiency for medical documentation. They typically use an application programming interface with a large language model (LLM), for example, generative pretrained transformer 4. They use automatic speech recognition on the physician–patient interaction, generating a full medical note for the encounter, together with a draft follow-up e-mail for the patient and, often, recommendations, all within seconds or minutes. This provides physicians with increased cognitive freedom during medical encounters due to less time needed interfacing with electronic medical records. However, careful proofreading of the AI-generated language by the physician signing the note is essential. Insidious and potentially significant errors of omission, fabrication, or substitution may occur. The neural network algorithms of LLMs have unpredictable sensitivity to user input and inherent variability in their output. LLMs are unconstrained by established medical knowledge or rules. As they gain increasing levels of access to large corpora of medical records, the explosion of discovered knowledge comes with large potential risks, including to patient privacy, and potential bias in algorithms. Medical AI developers should use robust regulatory oversights, adhere to ethical guidelines, correct bias in algorithms, and improve detection and correction of deviations from the intended output.


Takeaways

Question: How do artificial intelligence (AI) scribes work and what are the advantages and limitations of using them?

Findings: AI scribes harness large language models, such as Generative Pretrained Transformer 4, trained on large medical records corpora to create medical documentation. They provide benefits in terms of efficiency and more patient-focused encounters. They also pose risks including inaccuracies, biases, and potential breaches of privacy.

Meaning: The article provides an overview of advantages, limitations, and practical recommendations for physicians and medical AI developers to enhance the reliability and safety of AI-generated medical documentation.

WHAT IS ARTIFICIAL INTELLIGENCE SCRIBE TECHNOLOGY?

As has been extensively discussed in a wide range of forums, artificial intelligence (AI) represents a quantum leap for human efficiency. AI is currently being adapted for generating medical documentation, and early reports suggest these efforts are being adopted rapidly. The Permanente Medical Group (TPMG), for example, enabled ambient AI scribe technology in 2023, and reported 3442 TPMG physicians using the tool across 303,266 patient encounters in its first 10 weeks.1 There is clearly a need for evaluation of AI in a medical context. The authors of the current article are a physician early AI adopter, a linguist focusing on human interaction, and a computer scientist with extensive experience with AI and large language models (LLMs). The goal of our collaboration in this article was to share the early adopter’s experiences with an AI scribe application, focusing on its advantages and its limitations, along with recommendations for medical AI developers.

HOW DOES AI SCRIBE WORK?

AI scribes are phone-, tablet- or desktop-based applications (eg, Freed, https://www.getfreed.ai/; Heidi, https://www.heidihealth.com/) that listen to physician-patient interaction and generate an impressively detailed subjective, objective, assessment, and plan (SOAP) note, together with an email summary intended for the patient, which is all typically generated within a minute of the end of an appointment. The AI scribe costs approximately $100 per month per user versus the average medical scribe salary costs of $2800 per month.2 How is it done (Fig. 1)? First, automatic speech recognition technology converts the audible interaction to text. Next, the LLM being used, for example, Chat Generative Pretrained Transformer 4 (GPT-4) by Open AI, completes the complex task of writing by using artificial neural network algorithms based on potentially trillions of words of both general and medical text and applying them to the automatic speech recognition output of both the physician and patient statements during the encounter.35 AI scribe companies provide an application programming interface with GPT-4 or other LLMs for producing documentation specific to healthcare.6 In this article, we report on the early adopter physician author’s use of Freed AI, cloud-based software that uploads securely to Microsoft’s Azure cloud, which is Health Insurance Portability and Accountability Act–compliant. Freed claims it does not learn from protected health information (PHI), and deletes the notes in 30 days.7 Microsoft offers its own AI medical note-taking tool called Nuance DAX Copilot.8 Other AI scribe vendors use cloud service providers such as Google Cloud Platform or Amazon Web Services to remotely host the LLM infrastructure.9 Google’s LLM system used for some medical scribe applications is called Med-Gemini and uses Med-PaLM technology.10 Amazon Web Service Healthscribe is powered by Bedrock, an umbrella of Foundation Models including the LLMs Claude and Claude Instant.11 The commonly used publicly available version of GPT-4 overcomes being out of date by learning from (and often memorizing) information from millions of people who upload their text when they use ChatGPT.12 Thus, physicians need to be clear that they should not send PHI through these services, because ChatGPT (and others) essentially retain their input to train, and it may potentially be possible to reidentify and expose the incorporated PHI.9 The Food and Drug Administration (FDA) has purview over the use of AI in healthcare, regulating software as a medical device.3,13 However, AI scribe may not fit the definition of software as a medical device, unless it is considered clinical decision support.14 As with many developments of AI, with the speed at which technology changes, it can be hard for government oversight and regulatory bodies to keep up.

Fig. 1.

Fig. 1.

Diagram of how AI scribe works.

AI SCRIBE IMPROVES PHYSICIAN–PATIENT INTERACTIONS

Harnessing AI to scribe removes the computer screen as a focus point in a patient-physician appointment (Fig. 2). The scribe allows the physician to be a more involved and active listener and communicator through increased observation of patient gestures, sustained eye contact, body language, and nonverbal communications—all done more easily when facing patients and not screens during the interaction. The AI takes on the role of extracting and collating details, making recommendations of diagnostic tests, office visits, referrals to other doctors, prescriptions, activity levels, and instructions for care. An important limitation, though, is that the findings of physical examinations need to be verbalized to be captured by the AI.15,16 AI may often also capture a fuller set of clinically relevant information from dialogue with patients than a physician can alone, given that AI is immune to cognitive overload, distraction, fatigue,9 and time constraint. AI scribes can also document critical junctures between patients and physicians, such as informed consent or change of surgical plan. AI also helps the physician close the communication loop by drafting an email to be sent to the patient summarizing the appointment. (See appendix, Supplemental Digital Content 1, which displays an email summary created for the patient, http://links.lww.com/PRSGO/D776.)

Fig. 2.

Fig. 2.

Advantages of AI scribe.

AI SCRIBES AND RECORDINGS

It seems likely that (saved) AI transcripts will become a regular feature of the courtroom and may influence insurance payouts in the near future. There are obvious issues of legality where recordings are concerned, with different countries, and different states of the United States having different regulations about audio recording with and without prior knowledge and consent. Because voices and sounds are not audiorecorded but are transcribed in real time, technically, there is no recording. Currently, automatic transcripts seem to be generally considered a middle ground because the software creating the consultation summary is akin to the already established and accepted practices of having a human medical scribe or another type of virtual assistant in the room. AI scribe companies should obviously consider requirements for disclosure in different jurisdictions. As good practice, prior written informed consent should be obtained before using any type of AI scribe. Freed AI, for example, includes a recommended script for explaining its purpose and obtaining consent from patients. (See appendix, Supplemental Digital Content 2, which displays how to explain AI scribe to patients by Freed AI,7 http://links.lww.com/PRSGO/D777.)

AI AND PRIMARY CARE

Current commercial AI scribe software is designed for physicians of all types, from primary care to specialized surgeons. Medical AI scribe developers aggregate the knowledge of many fields of medicine and use the best notes of the medical field as target learning material for note content and format. If patients’ medical conditions are mentioned in the current inventory of the medical history, AI may provide clinical decision support9 and recommend follow-up steps on conditions asynchronous to the consultation, such as “consultation with your gastroenterologist to effectively manage your acid reflux and IBS.” Although GPT-4 has trained on open source medical research literature, websites, podcasts, and videos,12 it has historically not had access to EHR on private networks for training its general foundation model, necessitating additional clinical-document training by AI scribe vendors.6

AI DISTINGUISHES VOICES WITH LIMITATIONS

AI scribe software such as Freed AI usually distinguishes voices (eg, patient and physician) despite currently using only 1 microphone to create the SOAP note. There are some performance limitations though. (Fig. 3) For example, because current standard LLMs are limited to text-based interactions, they cannot currently assess gestures, facial expressions, or other body language. Likewise, conversational pauses, eye contact, back-channeling and a range of other interactional strategies are not assessed.4 Often, physician’s comments and explanations may be incorrectly attributed to the patient. Personal, unrelated conversations, part of the chit-chat of human interaction, do not always appear as part of the note, but sometimes they do, and when this happens, they need to be identified and deleted by the proofreader. When multiple interactions over an extended time are captured as a single session, a multiple-patient summary can be generated in error. (See appendix, Supplemental Digital Content 3, which displays the multiple-interaction summary which was added to the assessment/plan section. For context, the physician exited patient 0’s room, then entered another room and consulted with patient 1, then entered the corridor to talk to staff. Patients 2, 3, and 4 were related to answering staff questions about other patients. Information is deidentified and the names have been changed, http://links.lww.com/PRSGO/D778.)

Fig. 3.

Fig. 3.

Limitations of AI scribe.

For these reasons, we recommend that medical AI developers incorporate an introductory phase, where the AI familiarizes itself with the distinct voices of the physician and office staff, along with an alert for when a new voice enters the encounter so that the physician can confirm the encounter is continuing, or tap their end visit button.

AI AND PROCEDURE NOTES

Freed AI offers a SOAP note, but currently no procedure note. This could be intentional due to compliance issues, and/or such a capability may be forthcoming. When attempting to provide the AI with content that can be used for a procedure note, step-by-step details need to be verbalized. Unlike with dictation software, there are no predefined hot words that trigger a phrasal expansion for standard consent or procedure detail. Because procedural notes are important for documenting what was done and major findings, it might be more productive to dictate procedural notes separately or use templates with metaphrases.

AI AND NOTE VARIABILITY

LLMs often vary in their output despite attempts to make uniform input.9,17 For example, verbally reviewing the FDA breast implant safety checklist should logically result in a standard list of risks in the AI-generated e-mail to the patient. However, currently, the notes vary from including all the bullet points of the consent to a select few. Medical AI developers should consider ways to achieve more uniform outcomes, perhaps by focusing on emphasis, such as repeating or verbally emphasizing a statement. In the AI field, LLMs are currently understood to be unpredictably sensitive to user prompts and input.18 It is useful to note that LLM applications can have their “temperature” configured higher for more creativity (and aggressive nonliteral summarization) and lower for more reproducible, literal and deterministic output.9

AI AND LLM DEVELOPMENT

Companies developing LLMs are known to keep a tight lid on their methodology,19 with many aspects of their programming considered close trade secrets. The general “black box” behavior of neural network algorithms needs to be considered in relation to variability. Output is based on complex multilayered learned connections that are opaque20 and generally inscrutable even to the engineers.21 It is difficult to constrain neural networks with established medical knowledge or rules, and incorrect conclusions are often inferred in their unsupervised learning. Currently, the FDA has limited ability to regulate these adaptive algorithms and unsupervised learning for many applications due to both technological and financial constraints.3 “White box” auditing, where testers have full access to internal code and design, has been proposed to give government regulators unrestricted access to the LLMs,22 although this would do relatively little to reduce their inherently opaque and unpredictable behavior. To better understand a physician’s desired style and to point out errors for improved adaptive learning, manual edits could also be utilized as “learn this style” feedback, along with queries to the company. Again, we recommend medical AI developers incorporate an introductory phase to learn about the user, their dictation style, and compliance prerequisites.

AI HALLUCINATIONS: ERRORS OF FABRICATION AND SUBSTITUTION

AI’s hallucinations are usually not in the form of linguistic errors, and thus are not mistakes in the same way as we are used to seeing from human or automatic transcriptions. LLMs are very powerful at being fluent, grammatical, and coherent. However, they sometimes override what they hear and put in place what they think was intended (or rather, what would be most likely in a given context).20 Examples of problematic hallucinations in the physician author’s AI notes follow. In an email to a patient for implant removal without capsulectomy, the medical AI wrote, “Capsules removed during surgery will be sent to pathology for examination.” However, capsulectomy was explicitly not part of the surgical plan, and was neither discussed nor implied. AI replaced “Aveli for cellulite,” with “Qwo for cellulite” although Qwo was not mentioned. Qwo is also no longer on the market. LLM output often replaces something that is rare or unseen in their training data with something more common and likely in the given context or scenario.23

Filler brands are not recognized by the AI scribe either, and are almost always replaced with “Botox,” which seems to be treated as a sort of generic word for filler by the AI. In another example, “Sciton BBL,” Broad Band Light for facial skin, was replaced with “Brazilian Butt Lift.” In other words, abbreviations and brand names currently require more guidance to the AI. Hallucinated LLM replacements were again probably based on training data of a few years old.

There are various reasons why LLMs do not update, including freezing the training epoch, paywalls to information, intellectual property issues, or piggybacking on older models. We recommend that retraining or overriding the vernacular of brand names would trigger fewer corrections, and because third-party plug-ins have that ability, it would be helpful for medical AI developers to incorporate this capability.

HOW TO FACILITATE PROOFREADING

With rare or unknown terminology, we recommend that the AI should draw attention to uncertainty and low confidence. This does not seem to be a frequent feature yet. In 9 months of use of Freed AI by the first author, it has only admitted to misunderstanding a word twice, writing “braecerus” in quotes as an explicitly uncertain rendition of the intended “procerus” and placing “bursty” in quotes as an explicitly uncertain rendition of the intended “tuberous” [breasts]. Although there are likely better ways to indicate uncertainty in either the speech recognition or context-driven generation,20 inclusion of quotes as a marker of uncertainty is far preferable to including no indication at all. We recommend medical AI developers incorporate better ways (eg, highlighted text, confidence scores expressed in percentages) to draw physicians’ attention to where the AI has low confidence and prompt the user and/or proofreader to pay attention and issue confirmation or corrections.

AI AND BIAS: RACE, SEX, AND SPONSORSHIP

As has been documented in the AI literature, LLMs may develop sexist and racist biases which are often attributable to the fact that they are learning in part from World Wide Web sources, such as Reddit and social media. ChatGPT, for example, has been observed to attribute occupations requiring higher education to men, high-sentiment terms to “Asian” individuals, and low-sentiment terms to “Black” individuals.24 Biases arise when there is underrepresentation of race and gender in the training data, such as the fact that Black women are less likely to undergo genetic testing of their breast tumor.25 Underrepresentation could be a result of socioeconomic status, incomplete health records, or disparity of care. Biases could also occur with overemphasis on outdated medical practices.3 As an example of potential bias, during note generation for a labiaplasty consultation by the physician author, Freed AI changed “clitoral hood” to “prepuce,” which is either a male anatomical analogue or a gender neutral word but which did not capture the intention of the physician to precisely convey a plan relating to the clitoral hood anatomy.

Sponsorship biases can also arise in healthcare LLMs given the control of companies producing the technology and/or hosting these cloud applications, such as Open AI, Microsoft, Google, and Amazon. The costs associated with developing and maintaining healthcare APIs could make smaller, third-party companies susceptible to sponsorship bias.9 Single-source bias occurs when the training data is from one source13 and could be an issue with centralization of control over AI.

AI AND THE IMPORTANCE OF CHECKING

As AI technologies make fewer errors, it is natural for users to trust them more and spend less time editing and proofreading. Physicians should always look out for errors where the AI might omit or fabricate information in a subtle or hard-to-detect manner but still with significant falsehood. Therefore, the first author spends more time proofreading AI’s detailed notes and emails than prior charting with templates and drop-down menus. The algorithms depend on pattern matching and frequency rather than abstract human thought and logic, and it is important to understand that their accuracy declines with rare, novel, or intricate concepts.26 AI trains on large datasets: GPT-4 has 13 trillion tokens, where its predecessor GPT-3 had 175 billion parameters to learn from.24 AI searches for emerging patterns and normalizes the average.4 Physicians should not become complacent as the technology improves.

AN AI-AUGMENTED FUTURE?

We believe that the age-old medical pedagogy of learning how to write a SOAP will change with increasing use of AI, as will all genres of writing in all fields. Some express concern that overreliance on automation will change and potentially reduce the skills of physicians21 Multimodal LLMs based on image, video, and text are rapidly being developed,11 with, as noted, nonverbal cues likely becoming interpretable by future AI scribes. Teaching communication skills for capture by multimodal LLM might become a component of continuing medical education. Coaching such skills might be profitably built into the AIs themselves by medical AI developers. The landscape of AI is changing so rapidly that next year or even next month will reveal new standards, norms, and ideas, and render some thoughts in the current article obsolete.

CONCLUSIONS

The medical field, like many others, has entered an unheralded era of AI-generated health documentation, data collection, and analysis. The repository of electronic medical record is growing 50% annually, and AI is simultaneously processing such repositories for a growing range of applications, unlocking the huge potential for knowledge discovery from them.5,27 Before 2023, LLMs had not had access to large corpora of medical data,3 but now the floodgates have opened. Epic, America’s largest EHR company with 29% of market share and 305 million patient charts, has partnered with Microsoft’s Azure Open AI.6 Clinical information can be extracted and mined from handwritten notes as well.28 Natural language processing technologies are being harnessed to write clinic notes, respond to patients, provide point-of-care decision support,29 simulate patient encounters for training,4 improve workflow, automate the coding of claims, develop treatment protocols, predict complications, diagnose,5 make discoveries, and decrease expenditures.6 Concomitant with this potential for explosion of knowledge by medical AIs, there is also the potential for exploitation of patient privacy, development of biased algorithms,16 and fabrication of unreliable content.6 Medical AI developers need to use robust regulatory oversights to protect data privacy and security, adhere to ethical guidelines,3 de-bias their algorithms,30 test for accuracy,22 and detect and correct deviations from the intended output (Fig. 4). We recommend physicians adopt a signature disclaimer, for example, “This note was drafted using AI and may potentially contain errors of artificial neural network omission, substitution, or fabrication.” Will the healthcare systems installing the software, LLM developers, EHR companies, and makers of healthcare plug-ins to LLM also share responsibility for errors? It seems unlikely. We believe physicians will likely be on the forefront of accountability for errors.12

Fig. 4.

Fig. 4.

Recommendations for improvement of AI scribe.

DISCLOSURES

Dr. Mess is on the speaker’s bureau for Allergan and Sciton. She has consulted for Becton, Dickinson, and Company once in the past 2 years and Medical Mutual Liability Insurance Society of Maryland twice in the past year. The other authors have no financial interest to declare in relation to the content of this article.

Supplementary Material

gox-13-e6450-s001.pdf (2.4MB, pdf)
gox-13-e6450-s002.pdf (1.5MB, pdf)
gox-13-e6450-s003.pdf (1.1MB, pdf)

Footnotes

Published online 16 January 2025.

Presented on AI Scribe at PSTM24, September, 2024, San Diego.

Disclosure statements are at the end of this article, following the correspondence information.

Related Digital Media are available in the full-text version of the article on www.PRSGlobalOpen.com.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gox-13-e6450-s001.pdf (2.4MB, pdf)
gox-13-e6450-s002.pdf (1.5MB, pdf)
gox-13-e6450-s003.pdf (1.1MB, pdf)

Articles from Plastic and Reconstructive Surgery Global Open are provided here courtesy of Wolters Kluwer Health

RESOURCES