Abstract
BACKGROUND
Electronic health record (EHR) documentation is a major contributor to work-related practitioner exhaustion and the interpersonal disengagement known as burnout. Generative artificial intelligence (AI) scribes that passively capture clinical conversations and draft visit notes may alleviate this burden, but evidence remains limited.
METHODS
A 24-week, stepped-wedge, individually randomized pragmatic trial was conducted across ambulatory clinics in two states. Sixty-six health care practitioners were randomly assigned to three 6-week sequences of ambient AI. The coprimary outcomes were professional fulfillment and work exhaustion/interpersonal disengagement from the Stanford Professional Fulfillment Index. Secondary measures included time spent on notes, work outside work (WoW), documentation quality with the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), and billing diagnostic codes reviewed by professional staff coders. Linear mixed models were used for intention-to-treat (ITT) analyses.
RESULTS
A total of 71,487 notes were authored, of which 27,092 (38%) were generated using ambient AI. Ambient AI use had a significant reduction in work exhaustion/interpersonal disengagement (−0.44 points; 95% confidence interval [CI], −0.62 to −0.25; P<0.001), and a nonsignificant increase in professional fulfillment (+0.14 points; 95% CI, 0.004 to 0.28; P=0.04) on a five-point Likert scale. Among secondary measures, time spent on notes decreased (−0.36 hours per day; 95% CI, −0.55 to −0.17). The reduction in WoW (−0.50 hours per day; 95% CI, −0.90 to −0.09) was sensitive to exclusion of extreme values and was no longer significant after removing the top 3% of daily observations. Diagnostic billing codes improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9, demonstrated mean scores ranging from 3.97 to 4.99 across domains on a five-point scale. No drift in software performance was detected.
CONCLUSIONS
In a real-world randomized implementation, ambient AI reduced health care practitioners’ work exhaustion/interpersonal disengagement but did not significantly increase professional fulfillment. Documentation time decreased without compromising diagnosis, billing compliance, or note quality. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; ClinicalTrials.gov number, NCT06517082.)
Introduction
Electronic health record (EHR) documentation is a major contributor to health care practitioner burnout, as evidenced by practitioners spending 1–2 hours documenting after-hours for every hour spent face-to-face with patients.1,2 Ambient artificial intelligence (AI) scribes that use generative AI to passively capture clinical conversations and draft visit notes in real time have emerged as a promising solution to mitigate burnout.3 Early quality improvement pilots with pre–post studies have shown modest reductions in note-writing time, total daily EHR time, and after-hours work, together with improvements in perceived cognitive load and professional fulfillment.4–12 Surveys from health care practitioners also report high usability and favorable effects on the patient encounter, and large-scale operational rollouts have demonstrated rapid uptake across diverse specialties.13
Despite encouraging pilot results, the evidence base remains limited in terms of methodological rigor and scope. Most studies have implemented single-group or nonrandomized cohorts and lasted between 4 and 12 weeks, with few reporting sample-size calculations. Outcome ascertainment was often confined to short-term surveys, and analyses rarely accounted for calendar time trends or intrapractitioner correlation. Consequently, the magnitude and durability of ambient AI’s impact on health care practitioners’ well-being, and the contextual factors that modify that impact, remain uncertain. Effects on objective EHR metrics were sometimes inconsistent; one controlled cohort reported note-time savings but no change in after-hours work.8,9 Furthermore, none examined a pragmatic, workflow-embedded deployment with sequential implementation. Only one other randomized clinical trial to date has evaluated ambient AI in a pragmatic, system-wide deployment, but the primary outcome was not practitioner well-being.14
To address these gaps, we conducted an EHR-embedded, stepped-wedge, individually randomized pragmatic clinical trial of an ambient AI vendor across a health system spanning two U.S. states and eight specialties This design introduced randomization and longer follow-up in a real-world deployment. We hypothesized that ambient AI would improve health care practitioners’ well-being, yielding higher fulfillment scores and lower work exhaustion/interpersonal disengagement (burnout) scores.
Methods
STUDY SETTING AND POPULATION
This pragmatic trial was embedded in the EHR of an academic health system with ambulatory clinics in Wisconsin and Illinois between August 21, 2024, and March 27, 2025. Health care practitioners with at least 20 encounters with ambulatory patients per week, 18 years of age or older, who use an Apple mobile device, and who can access Epic Haiku mobile (Epic Corporation) were eligible. Additional eligibility required a willingness to adopt the ambient AI tool. Health care practitioners planning a leave of more than 6 weeks or who had a human scribe and were unwilling to unenroll from that service were excluded.
A closed cohort of 66 health care practitioners (attending physicians and advanced practice practitioners) was recruited and individually randomly assigned in a 1:1:1 ratio into three 6-week sequences (waves) of a 24-week stepped-wedge schedule (Fig. 1). Stratified permuted-block randomization balanced specialty type and each wave transitioned from practice-as-usual to the ambient AI intervention at 6-week intervals (Supplementary Appendix, Section 1). The trial was registered on ClinicalTrials.gov (NCT06517082) on August 9, 2024. All primary and secondary end points were prespecified in the protocol and completed as planned. No deviations from the prespecified end points occurred. The full statistical analysis plan (version 2.0, June 23, 2025), analysis code, and participant survey data are available in the public Git repository (https://git.doit.wisc.edu/smph-public/LearningHealthSystem/ambientlistening) to support reproducibility.
Figure 1. Enrollment and Random Assignment of Health Care Practitioners onto Ambient AI.

Figure 1 shows the flow of health care practitioners through screening, enrollment, random assignment, and analysis of their notes. Eligible outpatient practitioners included physicians and advanced practice providers across participating primary care and specialty clinics. Practitioners were randomly assigned in a 1:1 ratio to ambient AI versus their usual documentation workflow. The figure depicts the number of practitioners assessed for eligibility, excluded prior to being randomly assigned, assigned to each wave, and included in the primary analysis. All participants were analyzed according to their assigned wave under the intention-to-treat principle.
EHR-EMBEDDED AMBIENT AI SOFTWARE
Abridge AI, Inc., 2024, a Health Insurance Portability and Accountability Act–compliant platform, was concurrently contracted for operational deployment for clinical practice. At the time of the first license activation, the deployed version was the generally available release as of October 10, 2024, which included automatic speech recognition and natural language generation technology to produce transcripts and drafts of documentation from clinical conversations. The software included several components — some involving bespoke large language models (LLMs) — and was partially developed on a proprietary dataset of deidentified clinical conversations with audio gold-standard transcripts and human annotations. The integration of Abridge into the EHR was achieved through Epic’s private application program interfaces (APIs) and Fast Healthcare Interoperability Resource R4 APIs. Epic’s private APIs facilitated the communication of audio files captured during ambient AI sessions and the return of the AI-generated notes. The software was documentation-only and provided no diagnostic or therapeutic decision support. Patients were informed of ambient AI use at check-in and verbally informed by health care practitioners; recording proceeded only with patient consent, and patients could decline without consequence.
RECRUITMENT AND DATA MONITORING
An email from the clinical operations team and department chairs directed interested health care practitioners to a Research Electronic Data Capture (REDCap) screening survey. Technical eligibility was verified and the institutional review board approved the trial with a waiver of patient consent and electronic health care practitioner consent. Training comprised a 1-hour group webinar plus an online user guide; health care practitioners could access dedicated IT support throughout the study. Weekly utilization dashboards and documentation compliance were reviewed by clinical operations teams. A difference-in-differences analysis monitored for drift by being alert to sudden changes in documentation efficiency.15 No blinding was performed, but all analytic code was prespecified in the statistical analysis plan and posted publicly to reduce the risk of selective reporting.
DIAGNOSIS BILLING COMPLIANCE AND NOTE QUALITY
Health care practitioners entered their own diagnostic and billing codes, which were then reviewed by certified health system coders as part of standard operations. Coders assessed the alignment of primary and secondary International Classification of Diseases, Tenth Revision (ICD-10) codes with documentation, using a standardized rubric, assigning partial credit for insufficient specificity and a score of zero for errors that affected billing validity (Supplementary Appendix, Section 2).
We assessed documentation quality in a random sample of AI-generated notes using the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), a validated instrument adapted here for evaluating summaries of clinical encounter transcripts.16 Transcripts served as the reference documents, an analog to prior use of source notes for multidocument summarization. The PDSQI-9 evaluates accuracy (including falsification/fabrication), thoroughness (major omissions), abstraction/synthesis, usefulness for clinical decision-making, stigmatizing language, and linguistic qualities (organization, comprehensibility, succinctness). Each item is scored on a five-point Likert scale with explicit, rule-based anchors for every level of performance. The citation domain was excluded, as transcripts represented a single-source document. Scoring was performed on the final AI-generated note using an LLM-as-a-judge implementation that has been validated against physician ratings across multiple note-generation tasks.17 Notes were analyzed in their unedited form before practitioner review. The full rubric, scoring rules, and automation scripts are available in Epic’s open-source GitHub repository at https://github.com/epic-open-source/evaluation-instruments/tree/main/src/evaluation_instruments/instruments/pdsqi_9.
PRIMARY AND SECONDARY OUTCOMES
The two coprimary outcomes that comprised well-being were (i) professional fulfillment (six-item subscale, scores from 6 to 30; higher=better) and (ii) work exhaustion/interpersonal disengagement (10-item composite, scores from 10 to 50; higher=worse) of the Stanford Professional Fulfillment Index (PFI).18,19 Consistent with the original validation of the PFI by Trockel et al., the subscale scores were calculated as the mean of their respective items (6 items for professional fulfillment and 10 items for work exhaustion/interpersonal disengagement), with each item scored on a five-point Likert scale and resulting subscale scores ranging from 1 to 5.18 The constructs of work exhaustion and interpersonal disengagement (e.g., depersonalization) correspond to the commonly used term burnout. The PFI was collected at baseline, and at 6, 12, 18, and 24 weeks.
Additional practitioner-reported secondary measures were task load, impact of work on personal relationships, meaningfulness of clinical work, perceived efficiency, perceived control over schedule, and trust in AI. Secondary measures of documentation efficiency were derived from EHR auditlog data and included time spent on notes normalized to an 8-hour workday, work outside work (WoW) also normalized to an 8-hour workday, time to patient follow-up after a visit, and the proportion of encounters closed on the same day or before the next patient visit.20
Sample Size Calculation:
Historical PFI data collected in 2023 from 1091 University of Wisconsin Health practitioners informed the sample size calculation. Test–retest reliability for PFI subscales was 0.71 to 0.80; a conservative within-practitioner intracluster correlation (ICC) of 0.65 was chosen. Published work suggested clinically meaningful change corresponds to a Cohen’s d of 0.44 to 0.55.18 Powering for the lower bound (d=0.44) with an alpha of 0.025 per coprimary outcome, an 85% power, an ICC of 0.65, and a discrete time–decay correlation yielded 22 health care practitioners per sequence (66 in total) for the stepped-wedge design. In the clinical trial, the health care practitioner-level overall ICC was found to be 0.64 (95% confidence interval [CI], 0.54 to 0.74) for work exhaustion/interpersonal disengagement and 0.75 (95% CI, 0.67 to 0.82) for professional fulfillment.
ANALYSIS PLAN
Analyses followed the intention-to-treat (ITT) principle in a linear mixed-effects model with random effects for participants using the averages of the item scores for the primary outcome. Calendar time trends were adjusted with period fixed effects (reference: period 1); for documentation metrics secondary outcomes, calendar time trends were adjusted using continuous weeks since the start of the study. The type-I error rate was split between the coprimary outcomes, with each outcome using a two-sided type-I error rate of 0.025 (total type-I error rate of 0.05). Analyses were also adjusted for specialty to account for the randomization stratification variable adjustment.21 To control for the false discovery of secondary measures, the Benjamini–Hochberg method was applied with a total two-sided type-I error rate of 0.1. Primary mixed-effects models included provider random intercepts, period fixed effects, and adjustment for specialty. Survey missingness was low; complete-case analyses were performed, and 95% CIs were calculated using robust standard errors with a biased-reduced linearization adjustment (CR2) using the “clubSandwich” package.22–24
For clinical use, thresholds to dichotomize professional fulfillment and exhaustion or interpersonal disengagement scores were provided in the validation work by Trockel et al.,18 who applied receiver operating characteristics analyses to identify thresholds with optimal sensitivity and specificity (at least 2.33 and at least 4.0 for work exhaustion/interpersonal disengagement and professional fulfillment scores, respectively) and against external anchors of quality of life and previously published burnout measures. These thresholds demonstrated meaningful discrimination and were further associated with clinically relevant outcomes, including medical error rates and depression-symptom severity. Guided by this prior work, and consistent with institutional practice at our site, we applied the same thresholds. In addition, the number needed to treat (NNT) was calculated by dichotomizing a within-practitioner change in work exhaustion/interpersonal disengagement score, where 1 indicates score improvement by the estimated treatment effect from the linear mixed-effects model. We then used a logistic generalized linear mixed model adjusted for time period fixed effects and specialty to estimate the adjusted odds ratio, which was used to estimate the absolute risk reduction and the NNT; CIs were calculated using the Wald method.25,26 All code was performed in R (4.3), and Supplementary Appendix, Section 3, provides the consolidated standards of reporting trials (CONSORT) checklist27 for reporting. Clinical trial data and analytic code are available at https://git.doit.wisc.edu/smph-public/LearningHealthSystem/ambientlistening.
Results
HEALTH CARE PRACTITIONER AND AMBIENT NOTE CHARACTERISTICS
The 66 enrolled health care practitioners were evenly distributed by specialty across the three randomization waves and were predominantly female (78.8%), with a median age of 42.5 years (interquartile range, 36.3–49.0). Most identified as non-Hispanic white (89.4%) and were physicians (72.7%). The specialty with the greatest representation was family medicine (45.5%). Health care practitioners had a median of 14 years of practice experience (interquartile range, 9–20) and cared for a median of 50 patients per week (interquartile range, 36–60). Pretrial ICD-10 billing compliance relied heavily on manual typing (83.3%) and templated notes (84.8%). The mean baseline professional fulfillment score was 3.42 (standard deviation [SD], 0.75), and the mean work exhaustion/interpersonal disengagement score was 2.49 (SD 0.96) on a five-point Likert scale. Most health care practitioners reported early adoption of new technology and felt comfortable using new tools with minimal training (Table 1).
Table 1.
Health Care Practitioner-level Baseline Characteristics.*
| Characteristic | Wave 1 (n=22) | Wave 2 (n=21) | Wave 3 (n=23) | Overall (n=66) |
|---|---|---|---|---|
| Female sex — n (%) | 19 (86.4%) | 15 (71.4%) | 18 (78.3%) | 52 (78.8%) |
| Age (years) — median [IQR] | 40.5 [38.3–46.8] | 48.0 [39.0–55.0] | 42.0 [35.5–47.0] | 42.5 [36.3–49.0] |
| Race or ethnicity† — n (%) | ||||
| Indigenous American or Alaska Native | 0 | 0 | 0 | 0 |
| Nonsoutheast Asian | 1 (4.5) | 0 | 0 | 1 (1.5) |
| African American or Black | 0 | 0 | 0 | 0 |
| Hispanic or Latino | 0 | 1 (4.8) | 0 | 1 (1.5) |
| Middle Eastern or North African | 0 | 0 | 0 | 0 |
| Native Hawaiian or Pacific Islander | 0 | 0 | 0 | 0 |
| Southeast Asian | 1 (4.5) | 0 | 0 | 1 (1.5%) |
| White | 18 (81.8) | 19 (90.5) | 22 (95.7) | 59 (89.4) |
| Multiracial or multiethnic | 2 (9.1) | 1 (4.8) | 0 | 3 (4.5) |
| Missing | 0 | 0 | 1 (4.3) | 1 (1.5) |
| Use more than one language in professional work‡ — n (%) | 3 (13.6) | 2 (9.5) | 1 (4.3) | 6 (9.1) |
| Missing — n (%) | 1 (4.5) | 0 | 0 | 1 (1.5) |
| Years in practice — median [IQR] | 15.0 [10.5–16.0] | 13.0 [9.5–26.0] | 14.0 [9.0–21.0] | 14.0 [9.0–20.0] |
| Missing — n (%) | 1 (4.5) | 0 | 0 | 1 (1.5) |
| Years in department — median [IQR] | 7.0 [3.0–8.5] | 8.0 [5.0–12.0] | 7.0 [3.0–13.0] | 7.0 [3.0–12.0] |
| Missing — n (%) | 1 (4.5) | 0 | 0 | 1 (1.5) |
| Years in specialty, median [IQR] | 10.0 [8.0–16.0] | 10.0 [7.5–20.0] | 11.0 [6.0–17.5] | 10.0 [7.0–18.0] |
| Missing — n (%) | 1 (4.5) | 0 | 0 | 1 (1.5) |
| Health care practitioner type — n (%) | ||||
| NP | 2 (9.1) | 1 (4.8) | 2 (8.7) | 5 (7.6) |
| PA | 5 (22.7) | 6 (28.6) | 2 (8.7) | 13 (19.7) |
| Physician | 15 (68.2) | 14 (66.7) | 19 (82.6) | 48 (72.7) |
| Random assignment specialty stratum — n (%) | ||||
| Family medicine | 10 (45.5) | 9 (42.9) | 11 (47.8) | 30 (45.5) |
| Internal medicine | 6 (27.3) | 6 (28.6) | 6 (26.1) | 18 (27.3) |
| Pediatrics/adolescent medicine | 2 (9.1) | 2 (9.5) | 2 (8.7) | 6 (9.1) |
| Other specialty§ | 4 (18.2) | 4 (19.0) | 4 (17.4) | 12 (18.2) |
| Patients seen in the clinic per week — median [IQR] | 55 [36.0–62.0] | 41 [36.0–57.0] | 50 [35.5–59.0] | 50 [36–60] |
| Missing — n (%) | 1 (4.5) | 0 | 0 | 1 (1.5) |
| Pretrial note taking method¶ — n (%) | ||||
| Manual typing | 17 (77.3) | 19 (90.5) | 19 (82.6) | 55 (83.3) |
| Templated | 18 (81.8) | 19 (90.5) | 19 (82.6) | 56 (84.8) |
| Dictation | 2 (9.1) | 0 | 1 (4.3) | 3 (4.5) |
| Scribes | 3 (13.6) | 5 (23.8) | 4 (17.4) | 12 (18.2) |
| Fluency Direct | 13 (58.1) | 13 (61.9) | 13 (56.5) | 39 (59.1) |
| Other (student) | 0 | 1 (4.8) | 0 | 1 (1.5) |
| PFI professional fulfillment (baseline) — mean (SD) | 3.66 (0.72) | 3.56 (0.74) | 3.08 (0.69) | 3.42 (0.75) |
| PFI work exhaustion/interpersonal disengagement (baseline) — mean (SD) | 2.36 (0.80) | 2.33 (0.65) | 2.74 (0.56) | 2.49 (0.96) |
| How often do you foresee yourself using the ambient listening tool in your daily work routine? — n (%) | ||||
| 100% patients | 11 (50.0) | 17 (81.0) | 13 (56.5) | 41 (62.1) |
| >50% patients | 11 (50.0) | 3 (14.3) | 10 (43.5) | 24 (36.4) |
| 25–50% patients | 0 | 1 (4.8) | 0 | 1 (1.5) |
| <25% patients | 0 | 0 | 0 | 0 |
| Compared with your peers, how early do you adopt new technology and use new applications? — n (%) | ||||
| Much earlier | 5 (22.7) | 1 (4.8) | 3 (13.0) | 9 (13.6) |
| Somewhat earlier | 11 (50.0) | 13 (61.9) | 8 (34.8) | 32 (48.5) |
| At an average pace | 5 (22.7) | 5 (23.8) | 8 (34.8) | 18 (27.3) |
| Somewhat later | 0 | 1 (4.8) | 4 (17.4) | 5 (7.6) |
| Much later | 1 (4.5) | 1 (4.8) | 0 | 2 (3.0) |
| When considering new technology, how willing are you to take risks and accept quirks with unproven solutions? — n (%) | ||||
| Extremely confident | 2 (9.1) | 1 (4.8) | 0 | 3 (4.5) |
| Moderately confident | 11 (50.0) | 9 (42.9) | 10 (43.5) | 30 (45.5) |
| Somewhat confident | 5 (22.7) | 8 (38.1) | 9 (39.1) | 22 (33.3) |
| Slightly confident | 3 (13.6) | 2 (9.5) | 3 (13.0) | 8 (12.1) |
| Not confident at all | 1 (4.5) | 1 (4.8) | 1 (4.3) | 3 (4.5) |
| How often do you influence others in your network to try new technologies? — n (%) | ||||
| Very frequently | 1 (4.5) | 3 (14.3) | 2 (8.7) | 6 (9.1) |
| Frequently | 12 (54.5) | 8 (38.1) | 7 (30.4) | 27 (40.9) |
| Occasionally | 6 (27.3) | 8 (38.1) | 10 (43.5) | 24 (36.4) |
| Seldom | 2 (9.1) | 2 (9.5) | 4 (17.4) | 8 (12.1) |
| Never | 1 (4.5) | 0 | 0 | 1 (1.5) |
| How comfortable are you with using new technology with minimal training? — n (%) | ||||
| Extremely comfortable | 3 (13.6) | 3 (14.3) | 2 (8.7) | 8 (12.1) |
| Comfortable | 10 (45.5) | 5 (23.8) | 13 (56.5) | 28 (42.4) |
| Neutral | 8 (36.4) | 11 (52.4) | 5 (21.7) | 24 (36.4) |
| Uncomfortable | 1 (4.5) | 2 (9.5) | 3 (13.1) | 6 (9.1) |
| Extremely uncomfortable | 0 | 0 | 0 | 0 |
IQR denotes interquartile range; NP, nurse practitioner; PA, physician assistant.
Race self-reported by practitioner.
Languages health care practitioners indicated that they used professionally, other than English, included Spanish, Portuguese, Hmong, French, Chinese, various Southeast Asian dialects (unspecified), various Middle Eastern dialects (unspecified, not including Pashto and Arabic), and various African languages (unspecified).
Other specialties included addiction medicine; dermatology; endocrinology, diabetes, and metabolism; neurology; and rheumatology and arthritis.
Health care practitioners could select multiple options, so percentages do not add up to 100%.
During the study period of usual care and ambient AI intervention, health care practitioners authored a total of 71,487 clinical notes, of which 44,395 were authored only by humans and 27,092 were generated using ambient AI. Among the human-authored notes, 31.9% occurred during the intervention period with ambient AI available. The note-weighted mean of health care practitioner ambient AI utilization was 71.0% (SD 31.6%). Pre- and posttrial interviews were completed on a subgroup of practitioners to share qualitative feedback on their experience (Supplementary Appendix, Sections 4 and 5).
PATIENT CHARACTERISTICS
Of all the eligible patients approached during the trial period, only 22 declined consent for audio recording, yielding a 99.92% consent rate. The median patient age was 49 years (interquartile range, 27–67), and 40.2% of encounters involved female patients. Most patients were identified as non-Hispanic white (88.4%), English was the preferred language in 98.1% of encounters, and most were conducted in person (92.1%). Additional patient variables are available in Table 2. No recording-related complaints were documented by patients during the ambient AI encounters.
Table 2.
Patient Encounter-level Baseline Characteristics.*
| Characteristic | Wave 1 Group (n=5740) | Wave 2 Group (n=5370) | Wave 3 Group (n=5483) | Overall (n=16,593) |
|---|---|---|---|---|
| Female sex — n (%) | 2247 (39.1) | 2224 (41.4) | 2197 (40.1) | 6668 (40.2) |
| Age (years) at encounter — median [IQR] | 50.0 [30.0–67.0] | 49.0 [28.0–67.0] | 48.0 [25.0–67.0] | 49.0 [27.0–67.0] |
| Race — n (%) | ||||
| Asian | 204 (3.6) | 210 (3.9) | 205 (3.7) | 619 (3.7) |
| African American or Black | 327 (5.7) | 367 (6.8) | 263 (4.8) | 957 (5.8) |
| Indigenous American or Alaska Native | 42 (0.7) | 36 (0.7) | 32 (0.6) | 110 (0.7) |
| Native Hawaiian or Pacific Islander | 2 (0.03) | 8 (0.15) | 4 (0.07) | 14 (0.08) |
| White | 5087 (88.6) | 4675 (87.1) | 4914 (89.6) | 14,676 (88.4) |
| Unknown | 8 (0.1) | 10 (0.19) | 6 (0.1) | 24 (0.1) |
| Declined to answer or blank | 70 (1.2) | 64 (1.2) | 59 (1.1) | 193 (1.1) |
| Ethnicity — n (%) | ||||
| Hispanic or Latino | 339 (5.9) | 329 (6.1) | 335 (6.1) | 1003 (6.0) |
| Not Hispanic or Latino | 5360 (93.4) | 5014 (93.4) | 5117 (93.3) | 15,491 (93.4) |
| Unknown | 5 (0.9) | 3 (0.06) | 0 | 8 (0.05) |
| Declined to answer or blank | 36 (0.6) | 24 (0.4) | 31 (0.6) | 91 (0.5) |
| Preferred language — n (%) | ||||
| American Sign Language | 5 (0.09) | 2 (0.04) | 4 (0.07) | 11 (0.07) |
| English | 5636 (98.2) | 5251 (97.8) | 5393 (98.4) | 16,280 (98.1) |
| Hmong | 3 (0.08) | 7 (0.1) | 5 (0.09) | 15 (0.09) |
| Spanish | 70 (1.2) | 79 (1.5) | 52 (0.9) | 201 (1.2) |
| Other | 25 (0.4) | 30 (0.6) | 29 (0.5) | 84 (0.5) |
| Blank | 1 (0.03) | 1 (0.02) | 0 | 2 (0.01) |
| Insurance type — n (%) | ||||
| Blue Shield | 755 (13.2) | 628 (11.7) | 722 (13.2) | 2105 (12.7) |
| Commercial or Commercial FFS | 941 (16.4) | 968 (18.0) | 953 (17.4) | 2862 (17.2) |
| Medicaid or Medicaid MCO | 483 (8.4) | 549 (10.2) | 498 (9.1) | 1530 (9.2) |
| Medicare | 1012 (17.6) | 1012 (18.8) | 1165 (21.2) | 3189 (19.2) |
| Medicare Advantage | 647 (11.3) | 613 (11.4) | 491 (9.0) | 1751 (10.6) |
| Quartz/Unity | 1711 (29.8) | 1435 (26.8) | 1499 (27.3) | 4645 (28.0) |
| Workers Comp/Other | 18 (0.3) | 8 (0.1) | 10 (0.2) | 36 (0.2) |
| None/self-pay | 173 (3.0) | 157 (2.9) | 145 (2.6) | 475 (2.9) |
| BMI — median [IQR] | 27.3 [22.7–33.0] | 27.6 [22.8–33.3] | 26.0 [21.2–31.3] | 27.0 [22.2–32.5] |
| Missing — n (%) | 161 (2.8) | 348 (6.5) | 270 (4.9) | 779 (4.7) |
| First systolic blood pressure (mmHg) — median [IQR] | 121 [110–133] | 123 [112–135] | 121 [111–134] | 122 [111–134] |
| Missing — n (%) | 894 (15.6) | 944 (17.6) | 2091 (38.1) | 3929 (23.7) |
| First diastolic blood pressure (mmHg) — median [IQR] | 76.0 [70.0–82.0] | 77.0 [71.0–83.0] | 76.0 [70.0–82.0] | 76.0 [70.0–82.0] |
| Missing — n (%) | 894 (15.6) | 944 (17.6) | 2091 (38.1) | 3929 (23.7) |
| Number of medications prescribed — median [IQR] | 1.0 [0–2.0] | 1.0 [0–2.0] | 1.0 [0–2.0] | 1.0 [0–2.0] |
| Elixhauser score — mean (SD) | 0.12 (2.51) | 0.47 (2.29) | 0.27 (2.04) | 0.28 (2.30) |
| Missing — n (%) | 5 (0.1) | 5 (0.1) | 5 (0.1) | 15 (0.1) |
| Diagnoses in Elixhauser categories† — n (%) | ||||
| Congestive heart failure | 47 (0.8) | 49 (0.9) | 43 (0.8) | 139 (0.8) |
| Cardiac arrhythmias | 144 (2.5) | 128 (2.4) | 112 (2.0) | 384 (2.3) |
| Hypertension | ||||
| Uncomplicated | 846 (14.7) | 633 (11.8) | 567 (10.3) | 2046 (12.3) |
| Complicated | 21 (0.4) | 16 (0.3) | 14 (0.3) | 51 (0.3) |
| Other neurologic disorders | 37 (0.6) | 139 (2.6) | 26 (0.5) | 202 (1.2) |
| Chronic pulmonary disease | 155 (2.7) | 162 (3.0) | 137 (2.5) | 454 (2.7) |
| Diabetes | ||||
| Uncomplicated | 284 (4.9) | 225 (4.2) | 168 (3.1) | 677 (4.1) |
| Complicated | 272 (4.7) | 155 (2.9) | 114 (2.1) | 541 (3.3) |
| Hypothyroidism | 257 (4.5) | 148 (2.8) | 124 (2.3) | 529 (3.2) |
| Renal failure | 111 (1.9) | 99 (1.8) | 62 (1.1) | 272 (1.6) |
| Liver disease | 53 (0.9) | 33 (0.6) | 34 (0.6) | 120 (0.7) |
| Solid tumor (without metastasis) | 50 (0.9) | 39 (0.7) | 26 (0.5) | 115 (0.7) |
| Rheumatoid arthritis/collagen vascular | 113 (2.0) | 292 (5.4) | 84 (1.5) | 489 (2.9) |
| Obesity | 292 (5.1) | 123 (2.3) | 93 (1.7) | 508 (3.1) |
| Depression | 296 (5.2) | 216 (4.0) | 227 (4.1) | 739 (4.5) |
| Visit mode — n (%) | ||||
| In-person | 5236 (91.2) | 4990 (92.9) | 5057 (92.2) | 15,283 (92.1) |
| Phone | 114 (2.0) | 49 (0.9) | 79 (1.4) | 242 (1.5) |
| Video | 390 (6.8) | 331 (6.2) | 347 (6.3) | 1068 (6.4) |
| Neighborhood-level demographics of encounters | ||||
| ADI — median [IQR] | ||||
| National | 37.0 [30.0–48.0] | 39.0 [30.0–50.0] | 35.0 [26.0–45.0] | 37.0 [28.0–48.0] |
| State | 2.00 [1.00–3.00] | 2.00 [1.00–4.00] | 2.00 [1.00–3.00] | 2.00 [1.00–3.00] |
| Missing — n (%) | 1216 (21.2) | 863 (16.1) | 846 (15.4) | 2925 (17.6) |
| Median household income — median [IQR] | US$11,800 [US$98,700–US$134,000] | US$110,000 [US$97,000–US$132,000] | US$120,000 [US$103,000–US$137,000] | US$118,000 [US$98,700–US$136,000] |
| Missing — n (%) | 1213 (21.1) | 851 (15.8) | 837 (15.3) | 2901 (17.5) |
| Proportion in poverty — median [IQR] | 6.02 [2.89–10.0] | 6.06 [3.38–10.0] | 5.53 [2.89–9.03] | 6.00 [3.15–9.44] |
| Missing — n (%) | 1187 (20.7) | 830 (15.5) | 814 (14.8) | 2831 (17.1) |
| Proportion on food stamps — median [IQR] | 5.83 [3.75–9.67] | 5.60 [3.74–9.89] | 4.85 [3.41–9.07] | 5.60 [3.70–9.47] |
| Missing — n (%) | 1187 (20.7) | 830 (15.5) | 814 (14.8) | 2831 (17.1) |
| Proportion homeowners — median [IQR] | 74.8 [58.0–84.1] | 74.8 [57.6–85.9] | 75.0 [59.0–84.8] | 75.0 [58.0–85.2] |
| Missing — n (%) | 1187 (20.7) | 830 (15.5) | 814 (14.8) | 2831 (17.1) |
| Proportion with education — median [IQR] | ||||
| Less than high school | 19.6 [12.9–26.9] | 3.35 [1.80–5.91] | 2.83 [1.59–5.02] | 3.06 [1.72–5.29] |
| High school | 19.6 [12.9–26.9] | 20.4 [12.9–28.8] | 18.9 [12.2–26.7] | 19.7 [12.7–27.3] |
| Some college | 27.6 [20.9–31.3] | 27.8 [21.2–31.8] | 26.6 [20.7–31.0] | 27.6 [20.9–31.6] |
| College | 48.6 [37.9–62.5] | 47.0 [33.6–61.4] | 50.0 [38.7–63.1] | 47.9 [36.9–62.5] |
| Missing — n (%) | 1187 (20.7) | 830 (15.5) | 814 (14.8) | 2831 (17.1) |
Demographic information for 21 encounter notes could not be accounted for due to data merge issues. The true total number of encounter notes in the preimplementation baseline period was 16,614. BMI denotes body mass index (the weight in kilograms divided by the square of the height in meters); and IQR, interquartile range.
Only categories with at least 100 patients listed. As patients could have multiple diagnosis codes, the frequency denominator is the cohort divided by the overall number, and percentages do not add up to 100%.
PRIMARY OUTCOME
Survey completion by health care practitioners was 99.70% (n=329). Seven practitioners exhibited low fidelity with ambient AI: three were enrolled but never initiated use, and four used it inconsistently, with at least one period of 0% utilization. Reasons for noninitiation or zero utilization included device incompatibility, scheduling conflicts, changed willingness to abandon human scribe, or user preference.
The practitioner-reported scores across all survey instruments and EHR audit log metrics are shown in Supplementary Appendix, Sections 6 and 7. Ambient AI use had a nonsignificant increase of 0.14 points in professional fulfillment (robust 95% CI, 0.004 to 0.28; P=0.04) and a significant decrease of 0.44 points in work exhaustion/interpersonal disengagement (robust 95% CI, −0.62 to −0.25; P<0.001), on a five-point Likert scale (Fig. 2). At baseline, 56.1% had work exhaustion/interpersonal disengagement scores above the threshold for high work exhaustion/interpersonal disengagement (at least 2.33), while 25.8% had scores above the threshold for high professional fulfillment (at least 4.0). By the end of the study, the proportion above the high work exhaustion/interpersonal disengagement threshold decreased to 35.4%, while the proportion crossing the high professional fulfillment threshold increased to 41.5%. The full distribution of scores between usual care and access to ambient AI is shown in Supplementary Appendix, Section 8. The treatment effect was sustained across the study waves, and the variability of primary outcome effect sizes across time can be seen in Figure 3. The NNT to achieve this magnitude of work exhaustion/interpersonal disengagement reduction in one health care practitioner was 1.68 (95% CI, 1.55 to 1.82), corresponding to an absolute risk reduction of 0.60.
Figure 2. Trial-Averaged Effect of Ambient AI on Primary Outcome of Professional Well-Being (Subcomponents with Professional Fulfillment and Work Exhaustion/Interpersonal Disengagement [Burnout]).

Professional fulfillment was assessed via six questions scored on a five-point Likert scale, ranging from not at all true (1) to completely true (5), with an increase on the Likert scale representing a positive response; and burnout represents work exhaustion and interpersonal disengagement on a five-point Likert scale with 10 questions, with responses ranging from not at all (1) to extremely (5), with a decrease on the Likert scale representing a positive response. CI denotes confidence interval.
Figure 3. Forest Plot for Time-on-Treatment Effect on Primary Outcome of Professional Well-Being (Subcomponents of Professional Fulfillment and Burnout).

Professional fulfillment was assessed via six questions scored on a five-point Likert scale, ranging from not at all true (1) to completely true (5), with an increase on the Likert scale representing a positive response; and burnout represents work exhaustion and interpersonal disengagement on a five-point Likert scale with 10 questions, scored from not at all (1) to extremely (5), with a decrease on the Likert scale representing a positive response. CI denotes confidence interval.
In a sensitivity analysis, the primary and secondary outcomes were performed on a per-period complier population, defined as those participants whose ambient AI utilization rate for the wave period was above 0% after they were given access. For example, if a participant had a utilization rate above 0% in the first wave period after they were given access but then ceased using the tool for the remaining waves, only their data through the compliant wave period was used. Ambient AI use had a nonsignificant increase of 0.15 points in professional fulfillment (robust 95% CI, 0.01 to 0.29; P=0.032) and a decrease of 0.46 points in work exhaustion/interpersonal disengagement (robust 95% CI, −0.65 to −0.26; P<0.001) on a five-point Likert scale, representing minimal departure from the primary ITT analysis results.
DOCUMENTATION EFFICIENCY AND SECONDARY MEASURES
Week-averaged WoW ranged between 0.01 hours and 35.08 hours, and time spent on notes ranged between 0.27 hours and 11.17 hours. Compared with the control period, health care practitioners with ambient AI spent less time in the EHR per 8 hours of patient time, with reductions in week-averaged time spent on notes (−0.36 hours; robust 95% CI, −0.55 to −0.17; adjusted P<0.001). In a sensitivity analysis, exclusion of the top 0.5% of observations (revised range: 0 to 8.17 hours) still showed a reduction in time spent on notes (−0.34 hours; 95% CI, −0.50 to −0.17). Week-averaged WoW was reduced (−0.50 hours; robust 95% CI, −0.90 to −0.09; adjusted P=0.03), but excluding the top 3% of observations (revised range: 0.01 to 9.87 hours) attenuated the effect size, resulting in a nonsignificant reduction (−0.19 hours; robust 95% CI, −0.40 to 0.03). The proportion of encounters closed on the same day increased with ambient AI use (absolute increase of 2.61%; robust 95% CI, −0.29 to 5.51; adjusted P=0.1), whereas the proportion of encounters closed before the next encounter (−0.21%; robust 95% CI, −1.82 to 1.41; adjusted P=0.80) and the proportion of encounters followed up within 2 weeks (0.51%; robust 95% CI, −0.40 to 1.43; adjusted P=0.32) did not differ (Supplementary Appendix, Section 9).
Among secondary measures, ambient AI was associated with reduced task load (−6.72; robust 95% CI, −10.43 to −3.00; adjusted P=0.001) and a decrease in the negative impact of work on personal relationships (−0.79; robust 95% CI, −1.31 to −0.26; adjusted P=0.01). It also improved perceived efficiency in clinical practice (−0.66; robust 95% CI, −0.94 to −0.39; adjusted P<0.001). Ambient AI use also increased the meaningfulness of work (1.13; robust 95% CI, 0.71 to 1.55; adjusted P<0.001) and trust in AI (0.32; robust 95% CI, 0.05 to 0.59; adjusted P=0.03). No changes were observed for perceived control over work schedule (0.10; 95% CI, −0.11 to 0.30; adjusted P=0.382).
In a sensitivity analysis, results for secondary outcomes were minimally altered: ambient AI correlated with a reduction in task load (−7.70; robust 95% CI, −11.39 to −4.01), a reduction in the negative impact of work on personal relationships (−0.78; robust 95% CI, −1.31 to −0.25), and an improvement in perceived efficiency of clinical practice (−0.75; robust 95% CI, −1.02 to −0.47). Using ambient AI also increased the meaningfulness of work (1.10; robust 95% CI, 0.68 to 1.51) and increased practitioners’ trust in AI (0.33; robust 95% CI, 0.05 to 0.61). No changes were observed for perceived control over work schedule (0.12; 95% CI, −0.10 to 0.33).
DRIFT MONITORING, ICD-10 DIAGNOSTIC CODING COMPLIANCE, AND NOTE QUALITY
Across 15 sequential 1-month rolling windows, no drift events were observed. A planned software update on October 29, 2024, was associated with a nonsignificant change in ambient AI utilization (+0.17 percentage points; 95% CI, −8.33 to 8.67; P=0.97), as well as nonsignificant differences between trial groups regarding WoW (+0.10 hours; 95% CI, −1.14 to 1.33; P=0.88) and time spent on notes (+0.17 hours; 95% CI, −0.29 to 0.61; P=0.50).
To assess ICD-10 coding compliance, we evaluated a stratified random sample of 6110 clinical notes authored by all 66 randomly assigned health care practitioners during the study period. The sample was balanced by health care practitioner and note type, with 50.4% generated using ambient AI. Notes produced with ambient AI demonstrated greater compliance with final ICD-10 billing diagnoses, as adjudicated by professional coders (Supplementary Appendix, Section 9), compared with notes authored without AI assistance, with a mean ICD-10 compliance score of 6.87 (95% CI, 6.76 to 6.98) for ambient AI generated notes versus 5.94 (95% CI, 5.81 to 6.07) for human-authored notes (P<0.001) on a scale of 0 to 10.
In the PDSQI-9 evaluation, we analyzed 7966 randomly sampled notes representative of all health care practitioners who used the ambient AI system. On average, notes contained 6559 input tokens (SD 2436) and 1953 output tokens (SD 334). For accuracy, which captures the fidelity of extraction and detection of any falsification or fabrication, the mean score was 4.44 (SD 0.93) on a five-point Likert scale. For thoroughness, assessing major omissions, the mean score was 4.57 (SD 0.68). The domain of abstraction (i.e., synthesis), reflecting the ability to integrate and summarize across data elements, scored 3.97 (SD 0.58). For usefulness, defined as the generation of information relevant and helpful for specialty-specific clinical decision-making, the mean score was 4.83 (SD 0.51). Finally, for linguistic quality, the software achieved high ratings: organization, 4.90 (SD 0.16); comprehensibility, 4.99 (SD 0.13); and succinctness, 4.63 (SD 0.55). Stigmatizing language appeared in less than 1% of the notes (n=12).
Discussion
This study demonstrates that ambient AI scribing reduced health care practitioner work exhaustion/interpersonal disengagement, with an NNT of 1.68 health care practitioners to achieve a 0.44-point reduction on a five-point work exhaustion/interpersonal disengagement scale, an effect that was also clinically meaningful. This is among the first randomized controlled trials of ambient AI scribing, building on prior pilot studies.4,9,12 Across 24 weeks, health care practitioners authored over 70,000 clinical notes, including 27,092 with ambient AI, underscoring high uptake and real-world feasibility.
A concurrent randomized trial by Lukac et al., available as a preprint at the time of this report, evaluated two ambient AI scribes (Microsoft DAX and Nabla) at an academic health system in California. Their primary outcome was time spent on notes, with a significant reduction observed for Nabla but not DAX (41 vs. 18 seconds).14 Burnout, measured with the same PFI instrument, showed nonsignificant reductions. Notably, utilization in their study was approximately 30%, less than half of that observed in our cohort. We attribute our higher fidelity in part to pilot work with implementation scientists and human-factors experts using rapid Plan-Do-Study-Act cycles. Our trial demonstrated improvements in multiple domains of health care practitioners’ well-being. In addition to a reduction in work exhaustion/interpersonal disengagement, there were reductions in task load and after-hours work, and gains in perceived efficiency and meaningfulness of clinical work.
Objective EHR metrics echoed the health care practitioner-reported benefits, with a decrease in daily documentation time by 0.36 hours (−21.9 minutes) per day. Although WoW also decreased by 0.50 hours (−30 minutes) per day, sensitivity analyses surfaced that the effect was likely concentrated among a small subset with extreme after-hours burden. In contrast, time spent on notes declined consistently across sensitivity analyses. Coding for billing diagnoses improved, and documentation quality using the PDSQI-9 performed consistently well across domains of accuracy, completeness, and linguistic quality, with particularly strong performance on measures of comprehensibility and organization. As part of our real-time monitoring, we implemented a software-agnostic drift detection system to continuously track metrics in documentation efficiency in 2-week rolling windows. The system flagged no major workflow disruptions during the 24-week trial period. Qualitative interviews revealed that health care practitioners experienced reduced cognitive burden, enhanced patient engagement, and alleviated evening charting, most likely contributing further to work exhaustion/interpersonal disengagement reduction. These insights mirror user-reported benefits from pilot studies at other health systems,5 as well as large-scale deployments.13 Nonadoption drivers from the interviews include device constraints, visit type (e.g., procedure-only), and patient preference.
Our design offered several pragmatic approaches and methodological advantages. Implementation was pragmatic, with training consisting of a 1-hour webinar and a user guide, licenses were distributed sequentially across specialties, and integration used Epic’s private APIs. Real-time drift monitoring ensured continuity without workflow disruption, reflecting operational conditions relevant to other health systems. For methods, the stepped-wedge randomization, paired with mixed-effects ITT modeling, mitigated secular trends and intrapractitioner correlation inherent in nonrandomized evaluations. Second, the extended duration permitted durable effect assessment beyond the typical 4- to 12-week windows of prior nonrandomized pilots.3–8,10–12 Third, our pragmatic rollout evenly licensed ambient AI access for all health care practitioners within standard operational workflows, ensuring equitable access across specialties. The sequential deployment mirrored real-world conditions with staggered training, licensing logistics, and IT coordination to offer a scalable implementation for health systems.
Several limitations occurred in the study. The practitioner cohort was predominantly younger, white, and female, with nearly half from family medicine. These features limit generalizability, particularly to more diverse, community-based, and late-adopting populations. Because participation required voluntary enrollment, the study population may reflect early adopters, who are more comfortable with novel technologies. Sensitivity analyses stratified by utilization levels supported the primary findings, but self-selection bias remains possible. Given the open-label design, expectancy or Hawthorne effects are possible, as administration of the PFI may increase practitioners’ awareness of particular work domains and thereby influence their self-reported experiences. The trial data did not collect patient-reported outcomes related to comfort or disclosure; future trials should assess patient experience and relationship quality.
In conclusion, this trial demonstrates that ambient AI scribes may reduce health care practitioner work exhaustion, interpersonal disengagement, and documentation burden without compromising billing compliance, supporting further evaluation in diverse settings. The improvement in work exhaustion and interpersonal disengagement but nonsignificance in professional fulfillment may reflect relief from negative affect (burnout) through documentation efficiencies without corresponding gains in positive affect (fulfillment), which may be influenced by broader organizational factors, such as staffing levels, professional autonomy, and team climate.
Supplementary Material
Disclosures
Author disclosures and other supplementary materials are available at ai.nejm.org.
This work was supported by funding from the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award (NIH/NCATS UL1TR002737). No funding was provided by the AI software company, and all software licenses were procured under a vendor software-as-a-service (SaaS) agreement between UW Health and Abridge AI, 2024.
We thank Tom Wise (Business Relation Management), Christine Cunningham and Michele Nickels (Health Information Management), Troy Lepein (Risk and Compliance/Business Integrity), Karen Nachman, Luke Rislove (Enterprise Analytics), Rachelle Buol and Tori L. McKinley (Clinical Documentation Integrity), and Nicole Riechers (IS Project Management Office). We also thank the Abridge team for their technical support and review of the manuscript by Michael Oberst, Ph.D.
This study’s data are available at https://git.doit.wisc.edu/smph-public/LearningHealthSystem/ambientlistening.
References
- 1.Adler-Milstein J, Zhao W, Willard-Grace R, Knox M, Grumbach K. Electronic health records and burnout: time spent on the electronic health record after hours and message volume associated with exhaustion but not with cynicism among primary care clinicians. J Am Med Inform Assoc 2020;27:531–538. DOI: 10.1093/jamia/ocz220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.McPeek-Hinz E, Boazak M, Sexton JB, et al. Clinician burnout associated with sex, clinician type, work culture, and use of electronic health records. JAMA Netw Open 2021;4:e215686. DOI: 10.1001/jamanetworkopen.2021.5686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Shah SJ, Crowell T, Jeong Y, et al. Physician perspectives on ambient AI scribes. JAMA Netw Open 2025;8:e251904. DOI: 10.1001/jamanetworkopen.2025.1904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Albrecht M, Shanks D, Shah T, et al. Enhancing clinical documentation with ambient artificial intelligence: a quality improvement survey assessing clinician perspectives on work burden, burnout, and job satisfaction. JAMIA Open 2024;8:ooaf013. DOI: 10.1093/jamiaopen/ooaf013. [DOI] [Google Scholar]
- 5.Duggan MJ, Gervase J, Schoenbaum A, et al. Clinician experiences with ambient scribe technology to assist with documentation burden and efficiency. JAMA Netw Open 2025;8:e2460637. DOI: 10.1001/jamanetworkopen.2024.60637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Haberle T, Cleveland C, Snow GL, et al. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inform Assoc 2024;31:975–979. DOI: 10.1093/jamia/ocae022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liu T-L, Hetherington TC, Stephens C, et al. AI-powered clinical documentation and clinicians’ electronic health record experience: a nonrandomized clinical trial. JAMA Netw Open 2024;7:e2432460. DOI: 10.1001/jamanetworkopen.2024.32460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ma SP, Liang AS, Shah SJ, et al. Ambient artificial intelligence scribes: utilization and impact on documentation time. J Am Med Inform Assoc 2025;32:381–385. DOI: 10.1093/jamia/ocae304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Shah SJ, Devon-Sand A, Ma SP, et al. Ambient artificial intelligence scribes: physician burnout and perspectives on usability and documentation burden. J Am Med Inform Assoc 2025;32:375–380. DOI: 10.1093/jamia/ocae295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Stults CD, Deng S, Martinez MC, et al. Evaluation of an ambient artificial intelligence documentation platform for clinicians. JAMA Netw Open 2025;8:e258614. DOI: 10.1001/jamanetworkopen.2025.8614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rotenstein L, Melnick ER, Iannaccone C, et al. Virtual scribes and physician time spent on electronic health records. JAMA Netw Open 2024;7:e2413140. DOI: 10.1001/jamanetworkopen.2024.13140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tierney AA, Gayre G, Hoberman B, et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024;5(3). DOI: 10.1056/CAT.23.0404. [DOI] [Google Scholar]
- 13.Tierney AA, Gayre G, Hoberman B, et al. Ambient artificial intelligence scribes: learnings after 1 year and over 2.5 million uses. NEJM Catal 2025;6(5). DOI: 10.1056/CAT.25.0040. [DOI] [Google Scholar]
- 14.Lukac PJ, Turner W, Vangala S, et al. A randomized clinical trial of two ambient artificial intelligence scribes: Measuring documentation efficiency and physician burnout. July 11, 2025. (https://www.medrxiv.org/content/10.1101/2025.07.10.25331333). Preprint.
- 15.Afshar M, Resnik F, Baumann MR, et al. A novel playbook for pragmatic trial operations to monitor and evaluate ambient artificial intelligence in clinical practice. NEJM AI 2025;2(9). DOI: 10.1056/AIdbp2401267. [DOI] [Google Scholar]
- 16.Croxford E, Gao Y, Pellegrino N, et al. Development and validation of the provider summarization quality instrument for large language models (PDSQI-9). J Am Med Inform Assoc 2025;32:1050–1060. DOI: 10.1093/jamia/ocaf068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Croxford E, Gao Y, Pellegrino N, et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digit Med 2025;8(1):640. DOI: 10.1038/s41746-025-02005-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Trockel M, Bohman B, Lesure E, et al. A brief instrument to assess both burnout and professional fulfillment in physicians: reliability and validity, including correlation with self-reported medical errors, in a sample of resident and practicing physicians. Acad Psychiatry 2018;42:11–24. DOI: 10.1007/s40596-017-0849-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bohman BD, Makowski MS, Wang H, Menon NK, Shanafelt TD, Trockel MT. Empirical assessment of well-being: the Stanford model of occupational wellbeing. Acad Med 2025;100:960–967. DOI: 10.1097/ACM.0000000000006025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sinsky CA, Rule A, Cohen G, et al. Metrics for assessing physician activity using electronic health record log data. J Am Med Inform Assoc 2020;27:639–643. DOI: 10.1093/jamia/ocz223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Localio AR, Berlin JA, Ten Have TR, Kimmel SE. Adjustments for center in multicenter studies: an overview. Ann Intern Med 2001;135:112–123. DOI: 10.7326/0003-4819-135-2-200107170-00012. [DOI] [PubMed] [Google Scholar]
- 22.Bell RM, McCaffrey DF. Bias reduction in standard errors for linear regression with multi-stage samples. Surv Methodol 2002;28:169–181 (https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2002002/article/9058-eng.pdf?st=X957j0aS). [Google Scholar]
- 23.Pustejovsky JE, Tipton E. Small sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. J Bus Econ Stat 2018;36:672–683. DOI: 10.1080/07350015.2016.1247004. [DOI] [Google Scholar]
- 24.Pustejovsky JE. ClubSandwich: cluster-robust (sandwich) variance estimators with small-sample corrections. R package version 0.6.1.9999. 2025. (http://jepusto.github.io/clubSandwich/). [Google Scholar]
- 25.Bender R Number needed to treat (NNT). Encyclo Biostat 2005;6:3752–3761. DOI: 10.1002/0470011815.b2a04032. [DOI] [Google Scholar]
- 26.Bender R Calculating confidence intervals for the number needed to treat. Control Clin Trials 2001;22:102–110. DOI: 10.1016/s0197-2456(00)00134–3. [DOI] [PubMed] [Google Scholar]
- 27.Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020;26:1351–1363. DOI: 10.1038/s41591-020-1037-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
