Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Jan 6.
Published in final edited form as: NEJM AI. 2025 Nov 26;2(12):10.1056/aioa2501000. doi: 10.1056/aioa2501000

Ambient AI Scribes in Clinical Practice: A Randomized Trial

Paul J Lukac 1,2, William Turner 3, Sitaram Vangala 3, Aaron T Chin 2,4, Joshua Khalili 5, Ya-Chen Tina Shih 5,6,7,8, Catherine Sarkisian 3,9, Eric M Cheng 2,10, John N Mafi 3,11
PMCID: PMC12768499  NIHMSID: NIHMS2132552  PMID: 41497288

Abstract

BACKGROUND

Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes’ impacts have not been examined in randomized clinical trials.

METHODS

In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications — Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla — or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index — Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.

RESULTS

DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], −17.2% to −1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (−1.7%; 95% CI, −9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10–50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0–400; DAX −39.9 [95% CI, −71.9 to −7.9]; Nabla −31.7 [95% CI, −63.8 to +0.4]), and PFI-WE (scale 0–4; DAX 0.32 [95% CI,−0.55 to −0.08]; Nabla −0.23 [95% CI, −0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted “occasionally” on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).

CONCLUSIONS

Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.)

Introduction

Burnout afflicts nearly half of physicians in the United States, approaching endemic levels and fueling a workforce exodus that amplifies an already critical physician shortage.15 Pernicious symptoms of exhaustion, depersonalization, and a diminished sense of personal accomplishment have far-reaching consequences. The effects manifest across the health care ecosystem, jeopardizing care access, doubling the risk of patient-safety events, incurring billions of dollars in undue costs, and endangering physician well-being.69 While causes of burnout are complex and multifactorial, electronic health records (EHRs) and related documentation burden are frequently cited culprits.1014 This excessive EHR charting forces physicians to spend as much as 2 hours documenting for every hour of direct care.15,16 A systematic review detailing EHR characteristics associated with burnout identified “insufficient time for documentation” as a top contributor.17

To address this problem, human scribes, both in person and virtual, have been deployed with some success, though these options present challenges around cost and accessibility, particularly among less well-resourced specialties and settings.1821 Digital scribes, which incorporate artificial intelligence (AI) to generate drafts that still require human editing, have shown variable impact, with little to no efficiency gain or burnout improvement, but generally positive user reception.2224 Fully autonomous ambient AI scribes, which leverage large language models (LLMs), have been received with great enthusiasm by both industry and providers and are less expensive than human scribes, increasing scalability and the potential for widespread adoption.2528 At present, over 50 ambient AI companies offer serviceable products, though few studies have rigorously evaluated their impact.2,2931 Two small, nonrandomized pilot studies reported that Dragon Ambient eXperience (DAX) Copilot usage decreases documentation time, EHR time, physician task load (PTL), and aspects of burnout.3234 However, a larger, nonrandomized study of DAX showed no significant changes in financial and EHR-use metrics, while, discordantly, survey respondents in the intervention group subjectively reported decreased EHR time.35,36 Separately, in a large pilot study of Nabla, researchers reported that Nabla usage resulted in small reductions in charting time and favorable physician reception.37

Despite the many questions surrounding the use of AI in health care as well as the novel challenges of generative AI (genAI), there is a dearth of randomized clinical trials evaluating its effectiveness and safety in real-world settings.29,3840 This trial, evaluating two AI scribes, aimed to investigate their effects on documentation time and physician psychometrics, as well as their usability, accuracy, and safety. In addition, with many AI options available to health care providers, this study sought to compare two leading vendors with a control group to inform future investment in this nascent technology.

Methods

TRIAL DESIGN

Enrolled physicians (N=238) were randomly assigned 1:1:1 to either of two intervention groups, one provisioned with Microsoft DAX Copilot (N=79) and one provisioned with Nabla (N=79), or to a contemporaneous control group (N=80). To achieve cohort balance, the study statistician performed covariate-constrained random assignment on physician baseline time-in-note (an Epic Systems, Inc. [Verona, WI] Signal metric), a single-item burnout score,41 and the number of self-reported clinic days per week. Further details on covariate-constrained randomization can be found in the Supplementary Appendix. A mandatory prestudy survey was sent in September 2024, and a poststudy survey was sent on January 4, 2025. The trial was implemented from November 4, 2024, to January 3, 2025.

The study protocol was developed in accordance with Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence (SPIRIT-AI) and registered on ClinicalTrials.gov (NCT06792890).42 The study results reported in this article follow the Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI).43 The University of California, Los Angeles (UCLA) Institutional Review Board (IRB-24–5425) determined that the study did not constitute human subjects research. Institutional urgency to rapidly test the tools during a brief contractual trial period precluded the study’s timely preregistration. The finalized trial registration was submitted on December 9, 2024, and published on January 27, 2025; the unpublished protocol manuscript can be found in the Supplementary Appendix.

PARTICIPANTS

Physicians were recruited via department-wide emails and nominations from department leaders. Physicians were required to hold at least one half-day of clinic per week, and those in intervention groups who used human scribes were obligated to forego this assistance during the study. Usage was restricted to English-only visits due to a lack of internal validation of translation capabilities. Figure 1 summarizes the trial recruitment.

Figure 1.

Figure 1.

CONSORT-AI Flow Diagram

Flow diagram illustrating the recruitment, randomization, and allocation of study participants. CONSORT-AI denotes Consolidated Standards of Reporting Trials–Artificial Intelligence.

INTERVENTION

Microsoft DAX Copilot version 2.0 and Nabla version 1.5 were used. Both products were integrated into the EHR, allowing AI scribe–produced text to populate Epic’s native notes. Participants received 1 hour of virtual training from physician informaticists and vendor representatives 1 to 2 weeks before the study’s onset. Training was standardized for length, vendor participation, and format of internally generated educational material. Two weeks after study onset, vendors delivered a second presentation highlighting advanced features and customization. Support included an internal messaging channel and listserv, as well as vendor assistance.

In compliance with California law requiring two-party consent for audio recording, all physicians were instructed to obtain verbal consent from all relevant parties and document it in their notes.

MEASURES

A prestudy Qualtrics survey characterized physician demographics and clinical time data. Total usage reflects all activity during the study period (November 4, 2024–January 3, 2025) to capture overall engagement.

The prespecified primary outcome was documentation time, specifically the amount of time spent writing each note, routinely measured by an Epic Signal EHR utilization metric (time-in-note). On a provider level, this metric represents the total number of minutes a physician spends writing notes in a week, divided by the total number of notes. For each provider, these results were averaged by month. For a baseline comparison, time-in-note data were collected for the 6 months preceding the study’s onset, while study data were collected from November 4, 2024, to January 3, 2025. We allowed 1 month’s lead time for participants to reach proficiency and thus only compared the second intervention month (December 2, 2024, to January 3, 2025) with the baseline for all participants, aggregated within each study group. Notably, the time-in-note metric does not account for time spent editing the AI scribe–generated draft within the vendor platforms.

Prespecified secondary outcomes included the following validated survey instruments (full descriptions in Table S1):

  1. Mini-Z 2.0 to assess burnout, work environment, work pace, and EHR stress (10–50 scale, where lower scores=worse burnout, more stressful workplace).2,44,45 This 10-question survey contains a single-item burnout question that has been validated against a historical benchmark, the Maslach Burnout Inventory (MBI),44,46 whereas the remaining questions show convergent validity with MBI subscales.47 Mini-Z 2.0 is often reported as a single-item burnout measure along with the totals of its two five-question subscales — Supportive Work Environment and Work Pace and EMR Stress. However, the unidimensional composite score can be used to “portray an overall ‘joy score’,”44 representative of the single-item burnout question and the two subscales, and has been reported in the context of digital-scribe use.22

  2. Four-item PTL to assess cognitive load related to stress from EHR documentation (0–400 scale, where lower scores=less cognitive load).48,49 PTL has been demonstrated as a possible mediator between EHR usability and MBI-defined burnout, wherein better usability is associated with a lower PTL and lower odds of burnout.48 As a continuous effect, it has shown direct correlation with burnout, where each 40-point (10%) decrease in PTL correlated with 33% lower odds of burnout in one national study.49

  3. Professional Fulfillment Index — Work Exhaustion (PFI-WE; 0–4 scale, where lower scores=less exhaustion).50,51 PFI-WE has demonstrated convergent validity with the MBI — Emotional Exhaustion subscale.50

The poststudy survey addressed other prespecified topics via five-point Likert questions, such as usability, occurrence of inaccuracies and biases, and perceived risks to safety.

Two additional Epic Signal metrics — time in the EHR on unscheduled days, which reflects the average number of minutes a physician spends in the EHR on days with no scheduled patients, and time in EHR outside scheduled hours — were collected; baseline comparisons were conducted as with time-in-note.

SAMPLE SIZE JUSTIFICATION

Sample size was constrained by exclusion criteria and a contractual limitation stipulating a maximum of 100 concurrent users of each tool. A sample size of at least 79 physicians per condition (the size of our smallest condition) provides 80% power to detect effect sizes as small as 0.50 standard deviations, assuming a two-sample t-test and a two-sided 0.025 significance level (twofold Bonferroni correction for the comparison of each tool with the control condition) for the primary outcome.

Using prestudy data from a comparable time frame, the estimated log-scale standard deviation of change for the time-in-note metric was approximately 0.31. This design provided sufficient power to detect a 15.5% relative improvement in the primary outcome. Given a baseline-year geometric mean time-in-note of approximately 4 minutes 43 seconds (s), this corresponds to an absolute difference of roughly 44 seconds. This effect size falls within the range of values reported.32,34,37

STATISTICAL METHODS

The primary analyses — comparing DAX versus the control and Nabla versus the control — were conducted according to the intention-to-treat principle. Time-in-note was log-transformed to address nonnormal distribution and compared with providers’ prior 6-month baseline. A linear mixed-effects model, which included a study group effect, a period effect (second month vs. first month), and the interaction of these terms, was utilized to model change in log-time-in-note, accounting for repeated measurements over time. Linear contrasts were used to evaluate the effect of each tool versus the control in the second month (primary hypothesis). Survey-derived (Mini-Z 2.0, PTL, PFI-WE) quantitative outcomes were analyzed using unadjusted linear regression models of the absolute change from baseline to postintervention. Binary outcomes (e.g., dichotomized burnout) were analyzed using logistic regression models of postintervention responses, adjusting for baseline response.

We used a multiplicity-adjusted significance level of 0.025 for primary outcome hypothesis testing. For all other outcomes, we did not perform hypothesis testing and only report estimated differences and 95% confidence intervals. Results were collected on January 15, 2025, and all analyses were performed from January 16, 2025, to May 19, 2025, using R version 4.4.2 (https://www.r-project.org/).

Post hoc exploratory analyses, examining the comparisons of Nabla versus DAX and any scribe use versus the control, and the association of scribe usage rates with outcomes, can be found in the Supplementary Appendix (Table S2, Fig. S1).

Results

PARTICIPANTS

In total, 238 physicians representing 14 specialties were enrolled (Table 1). Females were overrepresented (60.5%), and nearly half of the providers (46.6%) reported being 35–44 years old. At baseline, median time-in-note was 4 minutes 23 seconds for the DAX group, 4 minutes 47 seconds for the Nabla group, and 5 minutes 22 seconds for the control group.

Table 1.

Study Physician Demographics.*

Characteristic, N (%) DAX (N=79) Nabla (N=79) Control (N=80)
Sex
 Male 31 (39) 22 (28) 34 (43)
 Female 45 (57) 55 (70) 44 (55)
 Prefer not to answer 3 (4) 2 (3) 2 (3)
Age range (years)
 25–34 8 (10) 12 (15) 17 (21)
 35–44 37 (47) 35 (44) 39 (49)
 45–54 24 (30) 21 (27) 14 (18)
 55–64 4 (5) 6 (8) 7 (9)
 65+ 3 (3.8) 0 1 (1)
Prefer not to answer 3 (3.8) 5 (6) 2 (3)
Race
 Asian 34 (43) 33 (42) 31 (39)
 Black 2 (3) 0 3 (4)
 White 31 (39) 25 (32) 28 (35)
 Multiple/Other 6 (8) 7 (9) 10 (13)
 Prefer not to answer 6 (8) 14 (18) 8 (10)
Hispanic
 Yes 7 (9) 5 (6) 6 (8)
 No 68 (86) 66 (84) 70 (88)
 Prefer not to answer 4 (5) 8 (10) 4 (5)
Specialty
 Primary care 34 (43) 37 (47) 30 (38)
 Medical specialty 28 (35) 33 (42) 38 (48)
 Surgical specialty 17 (22) 9 (11) 12 (15)
Randomization metrics
 Baseline time-in-note (minutes) 4.38 (3.43, 6.05) 4.78 (2.80, 6.78) 5.37 (3.20, 7.52)
 Baseline single-item burnout 3.49 (0.77) 3.49 (0.81) 3.50 (0.78)
 Clinic days per week 3.35 (1.35) 3.45 (1.36) 3.36 (1.38)
*

Demographic data compiled via mandatory prestudy surveys. SD denotes standard deviation; Q, quartile.

Race was a multi-select question on the prestudy survey.

Time-in-note reported as median (Q1, Q3); single-item burnout and clinic days per week were reported as mean (SD).

DAX was used at 8271 of 24,696 (33.5%) of patient visits, whereas Nabla was used at 6981 of 23,653 (29.5%) visits. Control group encounters totaled 24,020. Approximately 15% of treatment-group physicians never used their assigned scribe. Feedback left by 28 physicians who indicated low or no use in poststudy survey responses or responded to the survey despite never having completed a scribe encounter can be viewed in Table S3.

PRIMARY OUTCOME

Per intention-to-treat, time-in-note for each provider reflects the average of all notes written per month, regardless of scribe use. In our mixed model analysis comparing baseline log time with December data, time-in-note declined by an estimated 18 seconds (from 4 minutes 22 seconds to 4 minutes 4 seconds) in the control group, 23 seconds (from 4 minutes 29 seconds to 4 minutes 6 seconds) in the DAX group, and 41 seconds (from 4 minutes 30 seconds to 3 minutes 49 seconds) in the Nabla group. The reduction in the Nabla group was significantly larger than in the control group (−9.5% [95% confidence interval (CI), −17.2% to −1.8%; P=0.02), while the reduction in the DAX group did not significantly differ from the control group (−1.7%; 95% CI, −9.4% to +5.9%; P=0.66) (Table 2). Although not statistically significant in a month-to-month comparison, reductions in November (Nabla=−5.0% [95% CI, −12.7% to +2.7%]; DAX=−0.0% [95% CI, −7.6% to +7.7%]) were smaller than in December.

Table 2.

Documentation Efficiency Metrics and Psychometric Outcomes from Survey Assessments.*

Measure DAX vs. Control Nabla vs. Control
Efficiency Metrics (%) Difference 95% CI Difference 95% CI
Time in notes per note −0.02 −0.09 to 0.06 −0.10 −0.17 to −0.02
Time on unscheduled days 0.14 −0.05 to 0.33 0.06 −0.13 to 0.26
Time outside scheduled hours 0.09 −0.14 to 0.32 0.08 −0.15 to 0.31
Mini-Z 2.0
 Supportive work environment 1.51 0.55 to 2.47 1.55 0.59 to 2.51
 Work pace and EMR stress 1.41 0.38 to 2.43 1.20 0.18 to 2.23
 Composite 2.83 1.28 to 4.37 2.69 1.14 to 4.23
PFI-WE −0.32 −0.55 to −0.08 −0.23 −0.46 to 0.01
PTL −39.89 −71.88 to −7.90 −31.69 −63.79 to 0.42
Binary Mini-Z 2.0 Odds Ratio Odds Ratio
Single-item burnout 0.78 0.40 to 1.54 0.67 0.34 to 1.34
Supportive work environment 1.87 0.83 to 4.23 1.84 0.82 to 4.14
Work pace and EMR stress 1.90 0.17 to 21.54 1.87 0.17 to 21.19
Composite — joyful Workplace 1.64 0.38 to 7.14 1.50 0.35 to 6.45
*

Values for efficiency metric differences represent relative proportional changes, e.g., −0.10 indicates a 10% decrease. Mini-Z 2.0, PFI-WE, and PTL differences represent the incremental change in prestudy (baseline) and poststudy survey response scores. Binary outcomes are reported as odds ratios, which represent subanalyses of the Mini-Z 2.0. Each of the listed elements of the Mini-Z 2.0 includes a threshold value that defines a dichotomous outcome, for example, responses above 3 to the single-item burnout question are interpreted as “not burnt out,” whereas responses 1, 2, and 3 are interpreted as “burnt out.” Full descriptions of each validated survey element can be found in Table S1. CI denotes confidence interval; EMR, electronic medical record; OR, odds ratio; PFI-WE, Professional Fulfillment Index — Work Exhaustion; and PTL, physician task load.

Primary outcome P values: DAX vs. Control=0.66; and Nabla vs. Control=0.02.

SECONDARY OUTCOMES

The poststudy survey was completed by 61 of 80 (76%) control-group physicians and 65 of 79 (82%) physicians in each intervention group. Composite Mini-Z scores increased, whereas PTL and PFI-WE scores decreased for users of either scribe versus the control (Table 2).

Inaccuracies were noted “occasionally” (DAX=2.7 [95% CI, 2.4 to 3.0]; Nabla=2.8 [95% CI, 2.6 to 3.0]) and bias “rarely” (DAX=1.6 [95% CI, 1.4 to 1.8]; Nabla=1.7 [95% CI, 1.5 to 1.9]) (Table 3). Respondents who noted “inaccuracies” and “bias” most frequently described omissions (N=12); structural concerns (e.g., undesired formatting, oversimplified language, or too much detail) (N=11); and pronoun resolution errors (N=8) (Table 4).

Table 3.

Accuracy, Bias, and Usability.*

Question DAX (N=65)
Mean (SD)
Nabla (N=65)
Mean (SD)
Inaccuracies and bias
 In the last 2 months, how often did an ambient scribe-generated note contain at least one clinically significant inaccuracy (e.g., hallucination, omission, or addition)? 2.7 (1.1) 2.8 (1.0)
 In the last 2 months, how often did an ambient scribe-generated note contain bias (e.g., unfair or skewed perspectives)? 1.6 (0.9) 1.7 (0.9)
Usability
 I used the tool as often as I could. 3.8 (1.4) 3.6 (1.4)
 The tool was easy to learn. 4.2 (1.1) 4.1 (0.9)
 The tool was easy to use. 4.2 (1.1) 3.9 (1.0)
 The tool decreased the time I typically spend working on notes in the clinic. 3.7 (1.3) 3.5 (1.2)
 The tool decreased the time I typically spend in the EHR outside of work hours. 3.6 (1.4) 3.4 (1.2)
 The tool decreased the time it typically takes me to close patient encounters. 3.8 (1.3) 3.4 (1.3)
 The tool is suited for the documentation needs of my specialty. 3.6 (1.3) 3.2 (1.3)
 I could see myself using the tool in the future. 4.1 (1.1) 3.8 (1.3)
 The tool generated notes at least as good as my own. 3.4 (1.4) 2.9 (1.3)
 The tool allowed me to engage more with my patients. 4.2 (1.0) 3.8 (1.1)
 Patients were generally okay with the use of the tool. 4.4 (0.9) 4.4 (0.9)
 Estimate the percentage of patients who declined use of this tool.§ 6.4% (12.1%) 7.2% (15.1%)
*

SD denotes standard deviation.

Response options: never (1); rarely (2); occasionally (3); frequently (4); and almost always (5).

Response options: strongly disagree (1); somewhat disagree (2); neutral (3); somewhat agree (4); and strongly agree (5).

§

Response option: slider scale 0 to 100, in increments of 1.

Table 4.

Reports and Categorization of Inaccuracy and Bias.*

Category Example Vendor(N)
DAX (N=26) Nabla (N=25)
Omission “Omitted some pertinent information, or at least information I would have liked to have for ‘smaller’ problems. Had to either type myself or ask it to add in information and regenerate the note.” 4 8
Structure/content “As a specialist, there were frequently health care maintenance problems listed in the assessment and plan that were not relevant. Often too much detail as well. But overall, it was helpful!” 7 4
Pronoun resolution “Gender was inaccurately identified.” 8 0
Affirmation/negation detection “Sometimes mishearing which symptoms are pertinent positives vs. person negatives.” 0 4
Speech recognition “Misunderstood ‘not drinking alcohol’ as ‘now drinking alcohol’. Patient caught the error but was not angry, just wanted it corrected.” 1 2
Salience “The AI often prioritized secondary issues and concerns over the primary concern or diagnosis. As such, it would elevate and elaborate more on the secondary concern, which was often unrelated, than the primary diagnosis/concern, especially if the primary issue was stable.” 2 1
Clinical attribution “AI included information about the parents’ medical history as the patient’s, did not distinguish.” 0 2
Hallucination “I should’ve been keeping track better, but at least a couple of times it mentioned something that wasn’t discussed.” 1 1
Speaker attribution “Specific issues often arose from the system not recognizing when I was speaking and when the patient or family was. Question or statements I may have made were attributed to the patient as facts for them.” 0 2
*

Four comments not presented in the table are categorized as “Other.”

One adverse patient safety event was reported. Five physician co-authors (P.J.L., J.N.M., C.S., E.C., and A.C.) independently deemed the event, described as “extensive patient counseling was not included in the assessment/plan or the patient instructions,” to be grade 1 (mild).52

Of all usability questions, physicians rated both scribes least favorably in response to, “The tool generated notes at least as good as my own,” where the average rating was “neutral” (DAX=3.4 [95% CI, 3.1 to 3.7]; Nabla=2.9 [95% CI, 2.6 to 3.2]). Users rated both scribes highly on improving physicians’ ability to engage with patients (DAX=4.2 [95% CI, 4.0 to 4.4]; Nabla=3.8 [95% CI, 3.5 to 4.1]) and patients’ receptiveness to their use (each scribe=4.4 [95% CI, 4.2–4.6]) (Table 3). Figure S2 illustrates the answer distributions for each Likert-scale question.

There were no differences in time spent in the EHR during unscheduled days or time spent in the EHR outside scheduled work hours (Table 2).

Discussion

In this randomized controlled trial of LLM-powered ambient AI scribes, we observed a modest reduction in the time spent on documentation among Nabla users compared with the control. Secondary end points indicate potential improvement in burnout, cognitive task load, and work exhaustion for DAX and Nabla users. Physicians found the scribes easy to use and felt that the technology allowed them to better engage with their patients, results which may reflect additional mechanisms for reduced burnout aside from documentation time-savings. For one of the most anticipated and rapidly adopted technological innovations in U.S. health care since the Health Information Technology for Economic and Clinical Health (HITECH) Act incentivized EHRs, these empirical findings are consistent with prior observational studies showing optimistic results3234,37 and reveal similar performance and reception across the two platforms. More broadly, by embedding a randomized trial within routine practice, our study provides high-quality real-world evidence and can serve as a model for scientifically sound and ethically responsible AI integration.

A central challenge to AI-scribe adoption lies in justifying investment in this costly technology; efficiency gains, such as decreased time-in-note, represent one plausible rationale. In a recent report from Peterson Health Technology Institute, health-system leaders identified increasing the “number of patient encounters per period” and “accuracy or completeness of coding for billing purposes” as two financial metrics.31 The former, however, contradicts a potential benefit documented in this study and others — improved clinician well-being — and risks exacerbating physician disillusionment. Prior research suggests that burnout costs U.S. health systems $7600 per physician per year (2015 dollars),7 translating to $36.5 million annually if applied to the 4800 physicians at UCLA. Any improvement in physicians’ experience of work and reduction in stress from work could translate to savings from decreases in burnout-associated turnover and reduced clinical hours, and future research should determine the precise budgetary impact of this nascent technology.

On a broader scale, generative AI (genAI) is increasingly seen as a potential means to ease physicians’ documentation burden, a known contributor to EHR-mediated burnout.17,53,54 Two prominent early use cases of genAI, AI scribes and generative pretrained transformer (GPT)–produced draft responses to patient messages (Epic Systems), have shown signs of reducing PTL and burnout, despite limited or no efficiency gains.3234,51 Building on these observations, our study provides preliminary support for AI scribes’ potential benefit to physician well-being, whether accompanied by efficiency gains (Nabla group) or not (DAX group). Holistically, the Mini-Z 2.0, PTL, and PFI-WE assess both drivers of burnout (e.g., workload, work conditions) and outcome states (e.g., burnout/exhaustion), and we noted concordant potential improvement across both the DAX and Nabla groups. Nevertheless, these secondary outcomes warrant cautious and nuanced interpretation and need to be confirmed in larger trials.

When our physicians reported on inaccuracies, they highlighted multiple manifestations, ranging from omissions to pronoun errors, magnifying the ambiguity and nuance in the clinical output of LLMs and suggesting that ongoing physician oversight will be necessary to ensure documentation fidelity and to prevent unintended downstream consequences, such as clinically significant omissions. Yet there remains a wide gap between our ability to deploy genAI at scale and our ability to validate genAI at scale. As noted by Bedi et al.,55 this is partly attributable to a lack of standardized tasks and dimensions of evaluation; for example, with AI scribes, should we focus on accuracy and factuality or omissions and comprehensiveness? Historically, quality assessment of EHR documentation has relied on manual human review with frameworks such as the Physician Documentation Quality Instrument — 9 (PDQI-9).56 Given that LLMs are nondeterministic and subject to versioning, which can influence output over time, relying on frequent user feedback for quality assurance is both unrealistic and may itself exacerbate task load.57

One potential solution would be for LLMs to augment or replace human evaluators (i.e., LLMs as quality control agents).58 As noted by Croxford et al., this too comes with pitfalls, such as the rapid evolution of LLMs outpacing our ability to validate the LLM evaluators, LLMs’ inherent reliance on and sensitivity to prompts, and the challenge in replicating a physician’s nuanced clinical judgement, which is necessary to determine if generated content is meaningful in the context of a patient’s clinical course.59 To be sure, the medical profession must embrace AI education and promote widespread AI literacy — essential steps toward safely and effectively integrating these tools into clinical practice.

STRENGTHS AND LIMITATIONS

Our study highlights several strengths and methodological insights. First, a randomized clinical trial represents the ideal method to control for selection biases, which is particularly important for optional workflows, such as whether or not to use an AI scribe. Second, unlike static interventions in pharmaceutical trials, these tools rapidly evolve, even over the course of a months-long study. Thus, a short, contemporaneous study period was advantageous in minimizing bias from product changes while also avoiding the pitfalls of offering each vendor sequentially, where one product could mature more than the other.

This study has limitations. First, it was conducted among 238 physicians at a single academic institution on English-only encounters from November 4, 2024 to January 3, 2025, and thus, our findings may not be broadly applicable to different practice settings or other times of the year. Second, our participants were mostly female, which could reduce generalizability given that females represent 38.1% of all physicians nationally.60 However, the inclusion of multiple specialties strengthens overall generalizability. Third, our trial was tightly scheduled due to limitations imposed by contract terms, which stipulated the length and number of users, and contributed to the postcommencement trial registration. This may have limited the degree of impact, given the time it takes to gain dexterity. The brevity may have also disincentivized physicians from investing time in learning the tool and customizing it, which could lead to an underestimation of positive results. Of note, while formal interaction testing was negative, we observed a month-to-month reduction trend in time-in-note, which may suggest a learning curve and potentially larger reductions over time. Fourth, physicians had the ability to edit the AI-generated text in both the scribe platforms and in Epic’s note-writing interface, and we learned midway through the trial that Epic’s Signal metrics do not account for platform time. Hence, reported time savings for time-in-note may represent an overestimation. This underrecognized limitation affects all AI scribe studies reporting Epic’s Signal metrics. Fifth, Nabla was not directly integrated within Epic’s mobile application (Haiku), whereas DAX was. Thus, Nabla users had to launch an encounter from a desktop to prompt a push notification on their phone, potentially affecting the physicians’ usability impressions. Sixth, there is potential for nonresponse bias with the poststudy survey, which could skew survey results positively or negatively.

Conclusion

We observed modest improvement in our primary outcome, time-in-note, for Nabla users. Though secondary psychometric end points suggest both DAX and Nabla may attenuate burnout and enhance the physician work experience, our preliminary findings underscore the need for future long-term studies to validate these trends across multiple institutions, establish a robust cost–benefit analysis, measure downstream effects on quality of care and patient safety, and precisely identify clinicians who will benefit most from this technology.

Supplementary Material

data sharing
disclosures
appendix

Acknowledgments

Supported by the University of California, Los Angeles (UCLA), Department of Medicine, the UCLA Faculty Practice Group, and the UCLA Value and Analytics Solutions Research Consortium. Dr. Mafi was supported by a National Institutes of Health (NIH)/National Institute on Aging (NIA) Beeson emerging leaders in aging research career development award (no. K76AG064392-01A1). In addition, Dr. Mafi received grants from the NIA, Arnold Ventures, and the Commonwealth Fund, and provided unpaid consulting to the Agency for Healthcare Research and Quality. Dr. Sarkisian was supported by an NIH/NIA midcareer award in patient-oriented aging research (no. 1K24AG047899) and the NIH/NCATS UCLA Clinical and Translational Science Institute (UL1TR001881 PI Dubinett).

We thank the UCLA Health Information Technology team for their technical support during the study and the participating physicians for their time and commitment. We appreciate the work of Chad Wes Villaflores, UCLA Healthcare Value Analytics Solutions Senior Data Scientist, for leading the ClinicalTrials.gov submission, Artem Romanov for his contributions to the IRB submission, and Katelyn Nguyen for her administrative support.

Footnotes

Disclosures

Author disclosures and other supplementary materials are available at ai.nejm.org.

References

  • 1.Berg S. Physician burnout rate drops below 50% for first time in 4 years. American Medical Association. July 2, 2024. (https://www.ama-assn.org/practice-management/physician-health/physician-burnout-rate-drops-below-50-first-time-4-years). [Google Scholar]
  • 2.Linzer M, Smith CD, Hingle S, et al. Evaluation of work satisfaction, stress, and burnout among US internal medicine physicians and trainees. JAMA Netw Open 2020;3:e2018758. DOI: 10.1001/jamanetworkopen.2020.18758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Shanafelt TD, West CP, Dyrbye LN, et al. Changes in burnout and satisfaction with work-life integration in physicians during the first 2 years of the COVID-19 pandemic. Mayo Clin Proc 2022;97:2248–2258. DOI: 10.1016/j.mayocp.2022.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Abbasi J Pushed to their limits, 1 in 5 physicians intends to leave practice. JAMA 2022;327:1435–1437. DOI: 10.1001/jama.2022.5074. [DOI] [PubMed] [Google Scholar]
  • 5.Nguyen M-LT, Honcharov V, Ballard D, Satterwhite S, McDermott AM, Sarkar U. Primary care physicians’ experiences with and adaptations to time constraints. JAMA Netw Open 2024;7:e248827. DOI: 10.1001/jamanetworkopen.2024.8827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.West CP, Dyrbye LN, Shanafelt TD. Physician burnout: contributors, consequences and solutions. J Intern Med 2018;283:516–529. DOI: 10.1111/joim.12752. [DOI] [PubMed] [Google Scholar]
  • 7.Han S, Shanafelt TD, Sinsky CA, et al. Estimating the attributable cost of physician burnout in the United States. Ann Intern Med 2019;170:784–790. DOI: 10.7326/M18-1422. [DOI] [PubMed] [Google Scholar]
  • 8.Guille C, Sen S. Burnout, depression, and diminished well-being among physicians. N Engl J Med 2024;391:1519–1527. DOI: 10.1056/NEJMra2302878. [DOI] [PubMed] [Google Scholar]
  • 9.Tawfik DS, Profit J, Morgenthaler TI, et al. Physician burnout, well-being, and work unit safety grades in relationship to reported medical errors. Mayo Clin Proc 2018;93:1571–1580. DOI: 10.1016/j.mayocp.2018.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Shanafelt TD, Dyrbye LN, Sinsky C, et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc 2016;91:836–848. DOI: 10.1016/j.mayocp.2016.05.007. [DOI] [PubMed] [Google Scholar]
  • 11.Moy AJ, Schwartz JM, Chen RJ, et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021;28:998–1008. DOI: 10.1093/jamia/ocaa325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Peccoralo LA, Kaplan CA, Pietrzak RH, Charney DS, Ripp JA. The impact of time spent on the electronic health record after work and of clerical work on burnout among clinical faculty. J Am Med Inform Assoc 2021;28:938–947. DOI: 10.1093/jamia/ocaa349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lou SS, Lew D, Harford DR, et al. Temporal associations between EHR-derived workload, burnout, and errors: a prospective cohort study. J Gen Intern Med 2022;37:2165–2172. DOI: 10.1007/s11606-022-07620-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rotenstein LS, Hendrix N, Phillips RL, Adler-Milstein J. Team and electronic health record features and burnout among family physicians. JAMA Netw Open 2024;7:e2442687. DOI: 10.1001/jamanetworkopen.2024.42687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sinsky C, Colligan L, Li L, et al. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Ann Intern Med 2016;165:753–760. DOI: 10.7326/M16-0961. [DOI] [PubMed] [Google Scholar]
  • 16.Gaffney A, Woolhandler S, Cai C, et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern Med 2022;182:564–566. DOI: 10.1001/jamainternmed.2022.0372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yan Q, Jiang Z, Harbin Z, Tolbert PH, Davies MG. Exploring the relationship between electronic health records and provider burnout: a systematic review. J Am Med Inform Assoc 2021;28:1009–1021. DOI: 10.1093/jamia/ocab009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bates DW, Landman AB. Use of medical scribes to reduce documentation burden. JAMA Intern Med 2018;178:1472. DOI: 10.1001/jamainternmed.2018.3945. [DOI] [PubMed] [Google Scholar]
  • 19.Mishra P, Kiang JC, Grant RW. Association of medical scribes in primary care with physician workflow and patient experience. JAMA Intern Med 2018;178:1467–1472. DOI: 10.1001/jamainternmed.2018.3956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Heckman J, Mukamal KJ, Christensen A, Reynolds EE. Medical scribes, provider and patient experience, and patient throughput: a trial in an academic general internal medicine practice. J Gen Intern Med 2020;35:770–774. DOI: 10.1007/s11606-019-05352-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rotenstein L, Melnick ER, Iannaccone C, et al. Virtual scribes and physician time spent on electronic health records. JAMA Netw Open 2024;7:e2413140. DOI: 10.1001/jamanetworkopen.2024.13140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nguyen OT, Turner K, Charles D, et al. Implementing digital scribes to reduce electronic health record documentation burden among cancer care clinicians: a mixed-methods pilot study. JCO Clin Cancer Inform 2023;7:2200166. DOI: 10.1200/CCI.22. [DOI] [Google Scholar]
  • 23.Owens LM, Wilda JJ, Hahn PY, Koehler T, Fletcher JJ. The association between use of ambient voice technology documentation during primary care patient encounters, documentation burden, and provider burnout. Fam Pract 2024;41:86–91. DOI: 10.1093/fampra/cmad092. [DOI] [PubMed] [Google Scholar]
  • 24.Haberle T, Cleveland C, Snow GL, et al. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inform Assoc 2024;31:975–979. DOI: 10.1093/jamia/ocae022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Gartner Research. Use-case prism: generative AI for healthcare providers. July 28, 2023. (https://www.gartner.com/en/documents/4582499).
  • 26.Sarkar U, Bates DW. Using artificial intelligence to improve primary care for patients and clinicians. JAMA Intern Med 2024;184:343–344. DOI: 10.1001/jamainternmed.2023.7965. [DOI] [PubMed] [Google Scholar]
  • 27.Wachter RM, Brynjolfsson E. Will generative artificial intelligence deliver on its promise in health care? JAMA 2024;331:65–69. DOI: 10.1001/jama.2023.25054. [DOI] [PubMed] [Google Scholar]
  • 28.Mess SA, Mackey AJ, Yarowsky DE. Artificial intelligence scribe and large language model technology in healthcare documentation: advantages, limitations, and recommendations. Plast Reconstr Surg Glob Open 2025;13:e6450. DOI: 10.1097/GOX.0000000000006450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Barr PJ, Gramling R, Vosoughi S. Preparing for the widespread adoption of clinic visit recording. NEJM AI 2024;1(11). DOI: 10.1056/AIp2400392. [DOI] [Google Scholar]
  • 30.Boyter M, Taylor B. Ambient speech 2025: examining vendor differentiators in a rapidly evolving market. KLAS Research, February 4, 2025. (https://klasresearch.com/report/ambient-speech-2025-examining-vendor-differentiators-in-a-rapidly-evolving-market/3475). [Google Scholar]
  • 31.Peterson Health Technology Institute. Adoption of artificial intelligence in healthcare delivery systems: early applications and impacts. March 25, 2024. (https://phti.org/ai-adoption-early-applications-impacts/).
  • 32.Duggan MJ, Gervase J, Schoenbaum A, et al. Clinician experiences with ambient scribe technology to assist with documentation burden and efficiency. JAMA Netw Open 2025;8:e2460637. DOI: 10.1001/jamanetworkopen.2024.60637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Shah SJ, Devon-Sand A, Ma SP, et al. Ambient artificial intelligence scribes: physician burnout and perspectives on usability and documentation burden. J Am Med Inform Assoc 2025;32:375–380. DOI: 10.1093/jamia/ocae295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ma SP, Liang AS, Shah SJ, et al. Ambient artificial intelligence scribes: utilization and impact on documentation time. J Am Med Inform Assoc 2025;32:381–385. DOI: 10.1093/jamia/ocae304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Liu T-L, Hetherington TC, Dharod A, et al. Does AI-powered clinical documentation enhance clinician efficiency? A longitudinal study. NEJM AI 2024;1(12). DOI: 10.1056/AIoa2400659. [DOI] [Google Scholar]
  • 36.Liu T-L, Hetherington TC, Stephens C, et al. AI-powered clinical documentation and clinicians’ electronic health record experience: a nonrandomized clinical trial. JAMA Netw Open 2024;7:e2432460. DOI: 10.1001/jamanetworkopen.2024.32460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Tierney AA, Gayre G, Hoberman B, et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024;5(3). DOI: 10.1056/cat.23.0404. [DOI] [Google Scholar]
  • 38.Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA 2023;330:866–869. DOI: 10.1001/jama.2023.14217. [DOI] [PubMed] [Google Scholar]
  • 39.Szolovits P. Large language models seem miraculous, but science abhors miracles. NEJM AI 2024;1(6). DOI: 10.1056/aip2300103. [DOI] [Google Scholar]
  • 40.Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJY, Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Netw Open 2022;5:e2233946. DOI: 10.1001/jamanetworkopen.2022.33946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Dolan ED, Mohr D, Lempa M, et al. Using a single item to measure burnout in primary care staff: a psychometric evaluation. J Gen Intern Med 2015;30:582–587. DOI: 10.1007/s11606-014-3112-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020;26:1351–1363. DOI: 10.1038/s41591-020-1037-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020;26:1364–1374. DOI: 10.1038/s41591-020-1034-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Linzer M, McLoughlin C, Poplau S, Goelz E, Brown R, Sinsky C. The Mini Z worklife and burnout reduction instrument: psychometrics and clinical implications. J Gen Intern Med 2022;37:2876–2878. DOI: 10.1007/s11606-021-07278-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Linzer M, Jin JO, Shah P, et al. Trends in clinician burnout with associated mitigating and aggravating factors during the COVID-19 pandemic. JAMA Health Forum 2022;3:e224163. DOI: 10.1001/jamahealthforum.2022.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Rohland BM, Kruse GR, Rohrer JE. Validation of a single-item measure of burnout against the Maslach Burnout Inventory among physicians. Stress Health 2004;20:75–79. DOI: 10.1002/smi.1002. [DOI] [Google Scholar]
  • 47.Olson K, Sinsky C, Rinne ST, et al. Cross-sectional survey of workplace stressors associated with physician burnout measured by the Mini-Z and the Maslach Burnout Inventory. Stress Health 2019;35:157–175. DOI: 10.1002/smi.2849. [DOI] [PubMed] [Google Scholar]
  • 48.Melnick ER, Harry E, Sinsky CA, et al. Perceived electronic health record usability as a predictor of task load and burnout among US physicians: mediation analysis. J Med Internet Res 2020;22:e23382. DOI: 10.2196/23382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Harry E, Sinsky C, Dyrbye LN, et al. Physician task load and the risk of burnout among US physicians in a national survey. Jt Comm J Qual Patient Saf 2021;47:76–85. DOI: 10.1016/j.jcjq.2020.09.011. [DOI] [PubMed] [Google Scholar]
  • 50.Trockel M, Bohman B, Lesure E, et al. A brief instrument to assess both burnout and professional fulfillment in physicians: reliability and validity, including correlation with self-reported medical errors, in a sample of resident and practicing physicians. Acad Psychiatry 2018;42:11–24. DOI: 10.1007/s40596-017-0849-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Garcia P, Ma SP, Shah S, et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open 2024;7:e243201. DOI: 10.1001/jamanetworkopen.2024.3201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.U.S. Department of Health and Human Services. Common terminology criteria for adverse events (CTCAE) v5.0. November 27, 2017. (https://dctd.cancer.gov/research/ctep-trials/for-sites/adverse-events/ctcae-v5-5x7.pdf).
  • 53.Pavuluri S, Sangal R, Sather J, Taylor RA. Balancing act: the complex role of artificial intelligence in addressing burnout and healthcare workforce dynamics. BMJ Health Care Inform 2024;31:e101120. DOI: 10.1136/bmjhci-2024-101120. [DOI] [Google Scholar]
  • 54.Yadav GS, Longhurst CA. Will AI make the electronic health record more efficient for clinicians? NEJM AI 2025;2(3). DOI: 10.1056/AIe2500020. [DOI] [Google Scholar]
  • 55.Bedi S, Liu Y, Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models. JAMA 2025;333:319. DOI: 10.1001/jama.2024.21700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing electronic note quality using the physician documentation quality instrument (PDQI-9). Appl Clin Inform 2012;3:164–174. DOI: 10.4338/ACI-2011-11-RA-0070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Ohde JW, Rost LM, Overgaard JD. The burden of reviewing LLM-generated content. NEJM AI 2025;2(2). DOI: 10.1056/AIp2400979. [DOI] [Google Scholar]
  • 58.Croxford E, Gao Y, Pellegrino N, et al. Development and validation of the provider documentation summarization quality instrument for large language models. J Am Med Inform Assoc 2025;32:1050–1060. DOI: 10.1093/jamia/ocaf068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Croxford E, Gao Y, Pellegrino N, et al. Current and future state of evaluation of large language models for medical summarization tasks. NPJ Health Systems 2025;2:6. DOI: 10.1038/s44401-024-00011-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Association of American Medical Colleges. U.S. physician workforce data dashboard. 2024. (https://www.aamc.org/data-reports/report/us-physician-workforce-data-dashboard).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

data sharing
disclosures
appendix

RESOURCES