Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2026 Feb 22:2026.02.20.26346503. [Version 1] doi: 10.64898/2026.02.20.26346503

Randomized Trial Protocol: Epic Generative AI Chart Summarization Tool to Reduce Ambulatory Provider Cognitive Task Load

Aaron T Chin 1,2,*, Nina Zhu 1,2,*, Thomas Kingsley 1, Pallavi Mynampati 1, Yan Phipps 1, Artem Romanov 3, Sitaram Vangala 3, Maxwell Weng 4, Lauren E Wisk 3,6, Hawkin Woo 1,3, John N Mafi 3,5,, Paul J Lukac 1,2,
PMCID: PMC12934868  PMID: 41757203

Abstract

Background:

EHR documentation and chart review contribute to clinician workload and burnout. To alleviate pre-charting burden, Epic has released a new generative AI chart summarizer tool, which has become widely adopted; however, its impact has not been examined in randomized trials.

Objective:

To evaluate whether access to an Epic generative AI chart summarization tool reduces cognitive task load among ambulatory providers compared with usual care.

Methods:

Two-arm, parallel-group randomized controlled trial among ambulatory clinicians across multiple specialties. Clinicians will be randomized 1:1 to tool access versus usual care for 90 days. The primary outcome is change in a 4-item physician task load (PTL) adapted for the pre-charting task. Exploratory outcomes include EHR-derived time metrics (Caboodle and Signal), professional fulfillment/burnout (PFI), usability (SUS), clinician satisfaction, aggregated patient experience item from CG-CAHPS, and reported safety related metrics.

Ethics and Dissemination:

Analyses will use clinician-level survey responses and aggregated EHR metrics; no patient-level protected health information will be included in the analytic dataset. Results will be disseminated via preprint and peer-reviewed publication.

Keywords: generative AI, EHR, chart review, clinician burden, task load, randomized trial

Background and Rationale

Introduction

Clinical documentation in the electronic health record (EHR) is essential for sharing information among care teams, accurately tracking changes in a patient’s health, and supporting appropriate medical billing. However, significant documentation burden contributes heavily to physician burnout and is associated with increased medical errors, worsening workforce shortages and excess healthcare costs.(13) Excessive time spent interacting with the EHR, often extending into after-hours “pajama time”, is a major driver of this burnout.(4) In the ambulatory care settings, the common practice of “pre-charting” compels providers to review multiple aspects of a patient’s chart, such as recent notes and lab results, to contextualize the patient’s history with the reason for their visit.(5) This time-consuming process can impact quality, effectiveness, and efficacy of patient care.

Leveraging OpenAI’s GPT, Epic Systems, Inc [Verona, WI], an EHR company, has developed multiple summarization-based tools to ease administrative burden related to documentation and chart search and boost end user productivity. Across their suite of summarization tools, Epic generates over 16 million summaries monthly.(6) One such tool is a chart summary application that generates summaries of historical patient notes to assist ambulatory clinicians during pre-charting. One small pre/post quasi-experimental study on this functionality showed potential time savings with generally positive reception from physicians.(7) Despite this promise, deploying generative AI in healthcare introduces unique challenges. There is an inherent risk of hallucinations, inaccuracies, and omissions in AI-generated text, which could inadvertently increase physician documentation time and lead to clinical errors, among other things.(812) Additional challenges include technical limitations such as latency and scalability, as well as suboptimal human–AI interface design that may impede efficient integration into clinical workflows. As a result, the net value of these tools cannot be assumed and must be evaluated in real-world settings, particularly given the non-trivial financial, computational, and environmental costs associated with their deployment. These challenges underscore the importance of rigorous evaluations of generative AI (genAI) prior to and throughout their integration into clinical care.

Although Epic has released a rapidly expanding suite of genAI features, there remains lack of rigorous evaluation, including randomized controlled trials, to assess their usability, utility, or safety in real clinical workflows.(7, 13, 14) As a result, despite widespread interest, the broader healthcare community has little empirical evidence on how these tools perform in practice impeding informed purchasing decisions. To inform the design of this study, we drew on the methodology of a prior randomized controlled trial evaluating ambient scribes, which served as a valuable precedent.(9) By conducting a randomized controlled trial, this study will assist UCLA Health in making a data driven business decision, and will also help fill that knowledge gap by generating high-quality data on user experience and impact.

Study Aims and Hypotheses

The primary aim of this study is to evaluate whether a native Epic genAI chart summarization tool reduces physician task load (PTL) compared to a control group. Exploratory endpoints include assessing the AI chart summarization tool on clinician metrics such as self-reported burnout, physician satisfaction, and productivity, as well as time efficiency with pre-charting tasks as measured via a Caboodle (native Epic dimensional database) metric. Additionally, we will explore the impact of this technology on the patient experience, specifically patients’ perceptions of how informed the clinician was about the patient’s medical history. Finally, we will evaluate whether AI literacy modifies adoption and effect of the tool.

Methods

Trial design

Two-arm, parallel-group randomized controlled trial with 1:1 allocation to intervention versus usual care control over a 90-day period.

Setting

The study will be conducted in ambulatory outpatient clinics at the University of California, Los Angeles (UCLA) Health, an integrated academic health system using a single enterprise electronic health record (Epic Systems, Verona, WI). Multiple ambulatory specialties will be included to reflect routine clinical workflows.

Participants

Eligible participants are ambulatory physicians and advanced practice practitioners (APP) across multiple specialties who hold at least one half-day of clinic per week. Participants will be recruited via email and/or survey invitation. Physicians in training are excluded.

Inclusion criteria

  • Ambulatory physicians and APP’s with at least one half-day clinic session per week

  • Completes baseline survey

Exclusion criteria

  • Physicians in training (residents and fellows) and psychologists

Randomization and allocation

A total of 284 ambulatory providers will be randomized 1:1 to the intervention or usual care control group. Randomization will be stratified by whether the participant has an active AI scribe license, and covariate-constrained randomization will be performed within strata to improve balance on baseline PTL (NASA-TLX–adapted score) and a modified baseline chart review time (Caboodle-derived). Due to the nature of the intervention, participants cannot be blinded to group assignment.

Intervention

The trial intervention will span from 2/23/2026 – 5/24/2026, encompassing a 90-day study period. Participants in the intervention group will continue their usual clinical practice with access to Epic’s outpatient chart summarization tool. Use of the tool is optional and intended solely to provide a summary for providers and does not provide clinical decision support. It automatically generates a general or focused summary of the most recent clinical notes for any patient on a provider’s schedule. The number of notes summarized is limited by the character constraints of the EHR – 24,000 English characters or 30 notes. The tool includes a “focus on” feature allowing a short user prompt to center the summary on a specific topic. Users may also manually select notes for summarization. For scheduled patients, summaries are batch generated approximately 36 hours before the appointment; for same-day appointments, users may need to trigger summary generation. Summaries are created exclusively from clinical notes and do not draw from any other parts of the patient chart, such as labs and imaging, which will still require manual review by the physicians. The included note types will be progress notes, consults, procedures, H&P, discharge summary and ED provider notes. All participants will receive training before using the tool that encompasses tool functionality and general limitations/risks of genAI output.

Comparator (usual care)

Clinicians in the control arm will continue usual pre-visit chart review workflows without access to the chart summarization tool during the 90-day period.

Outcomes

Primary Outcome

Four-item physician task load (PTL), adapted from the NASA Task Load Index (TLX), is a validated tool for assessing EHR-related tasks using four sub-scales (mental demand, temporal demand, physical demand, and effort).(15) This outcome is adapted to capture the task of pre-charting, defined as the practice of reviewing patient information in the EHR prior to a visit. Each sub-scale is rated from 0 (low) to 100 (high) and is aggregated to a 0–400 point scale. No patient level information will be collected for this outcome measure.

Prespecified exploratory outcomes

Outcome Measure Measure Description
Modified Total Chart Time Per Encounter Using Caboodle, Epic’s enterprise data warehouse, we will use a customized metric to measure clinician time spent reviewing the patient chart. Based on internal validation of where clinicians could access the chart summarization tool, the metric includes Caboodle Tier 1 activities corresponding to Clinical Review (all activities), Documentation limited to pre-charting activity only, and Other limited to Navigator-related activity, as well as the activity dedicated to the chart summarization tool. All note-writing, order entry, and encounter-signing activities are excluded. Time is captured for office and telemedicine visits and includes pre-encounter and in-encounter activity, with no time captured after checkout. Change will be assessed relative to a 6-month retrospective baseline, with analyses focused on months 2 and 3.
Professional Fulfillment Index The Professional Fulfillment Index (PFI) is a validated 16-item instrument that uses a 5-point Likert scale (0–4) to measure professional fulfillment, work exhaustion, and interpersonal disengagement.(16) Burnout is reported based on combined results of the work exhaustion and interpersonal disengagement subscales.
Self-Reported Pre-Charting Effectiveness Providers will answer questions regarding self-reported effectiveness and efficiency in pre-charting. The questions use a 5-point Likert scale to measure these metrics, with a higher score indicating greater self-reported efficiency and effectiveness.
Provider Satisfaction Scores Self-reported satisfaction survey that includes physician reported effect of chart summarization tool on pre-charting efficiency and effectiveness, and other potential unintended consequences. No patient level information will be collected for this outcome.
System Usability Scale The system usability scale (SUS) is a ten-item questionnaire that uses a 5-point Likert scale to measure different aspects of system usability. The total scale ranges from 0–100. A higher SUS score indicates higher usability.
Reported Safety Clinician-reported safety of AI-generated chart summaries, including perceived frequency of clinically significant errors and occurrence of major safety events during the intervention period. No patient level information will be collected for this outcome.
Tool-specific EHR feedback Aggregated real-time feedback submitted by clinicians through the EHR-native summarization interface during the intervention period, including ratings of accuracy, completeness, and hallucinations. No patient-level information will be collected for this outcome.
Consumer Assessment of Healthcare Providers and Systems Clinician & Group Survey (CG-CAHPS) Metric The Consumer Assessment of Healthcare Providers and Systems Clinician & Group Survey (CG-CAHPS) is a standardized patient feedback survey measuring experiences with providers.(17) We will examine changes in the CG-CAHPS mean scores compared to 6 months prior to the intervention for the question “In the last 6 months, how often did this doctor seem to know the important information about your medical history”.
Change in EHR Signal (Activity) Data - Time outside scheduled hours We will examine change from a retrospective baseline 6 months prior to enrollment in Signal metrics including time outside scheduled hours per scheduled day. Using this data will determine how a provider’s time is utilized in the EHR.
Clinician RVUs per week Average relative value units (RVUs) per week during intervention months 2 and 3, adjusted for baseline (6-month pre-intervention period).

Effect Modification and Subgroup Analyses

The following subgroup analyses are prespecified and exploratory in nature.

  • Baseline chart review time (Caboodle-derived; continuous).

  • AI literacy (MAILS short version; total score 0–100; continuous).(18, 19)

  • Access to ambient scribe (binary).

  • Specialty category (primary care, medical specialty, surgical specialty).

  • Panel complexity (clinician-level RAF score; continuous).

  • Clinician age group.

  • Clinician sex.

  • Exploratory time-varying effects by month during the 90-day intervention period.

Data Collection and Sources

The study dataset will include both pre-implementation and post-implementation data. Pre-implementation data will be collected via a Qualtrics survey administered prior to tool deployment to confirm physician eligibility, document opt-in, and capture baseline characteristics. Post-implementation data will be collected after three months of tool use and will include physician follow-up survey responses, EHR-derived usage metrics from Epic Caboodle and Signal, and aggregated patient encounter data. We will also extract aggregated, tool-specific feedback submitted within the chart summarization interface during the intervention period and calculate category-specific report rates relative to total summary generation volume. No patient-level protected health information will be collected; all EHR data will be de-identified, aggregated, and analyzed exclusively at the physician level.

Statistical Analysis

Linear mixed models will be used to estimate intervention effects, with provider random effects accounting for repeated measurements (baseline and follow-up for survey-derived outcomes; baseline, month 1, and months 2+3 for modified total chart time). Models will include fixed effects for study arm, response period, and the interaction of these terms, and will adjust for the following provider characteristics: provider sex, age, specialty (primary care, medical subspecialty, and surgical subspecialty), and number of half days in clinic. As well as baseline task load score, modified total chart time per encounter, and ambient scribe usage. Intervention effects will be evaluated using linear contrasts between study arms in the follow-up period for survey outcomes, and in months 2+3 for modified total chart time. Exploratory analyses evaluating effect heterogeneity will be performed by testing for interactions with the proposed modifiers and estimating effects in subgroups using linear contrasts. We will also perform an exploratory analysis contrasting the effect on modified total chart time in month 1 with that of months 2+3 using linear contrasts.

The proposed intention-to-treat analysis of survey outcomes is robust to dropout under the missing at random (MAR) assumption. We will perform sensitivity analyses comparing the characteristics of responders and non-responders to assess the plausibility of this assumption, and will perform robustness checks using other approaches for addressing missing data (e.g., multiple imputation). Evaluation of the primary outcome will use an 0.05 significance level; all other analyses will report effect estimates and 95% confidence intervals without hypothesis testing. Analyses will be performed using R v. 4.5.1 (https://www.r-project.org/).

Sample Size and Power

The sample size of 284 participants provides 80% power to detect intervention effects as small as 0.33 standard deviations (small-to-medium effect size). This assumes a two-sample t-test on the pre-post change (a conservative approximation of the planned linear mixed model analysis), and a two-sided 0.05 significance level.

Data monitoring and missing data

Given the minimal-risk, clinician-facing nature of the study and absence of patient-level outcomes, no data monitoring committee is planned. EHR-derived metrics from Caboodle and Signal are expected to be complete for enrolled clinicians so missing data for these outcomes is expected to be negligible. For survey-based measures, we expect high follow-up survey completion but we will evaluate the extent and source of missing by comparing all baseline characteristics for clinicians with complete vs incomplete data at follow-up – allowing us to assess the plausibility of a MAR assumption. Subsequent missing data strategy for follow-up survey data will be informed by this evaluation.

Patient involvement

Patients were not involved in the design, conduct, reporting, or dissemination plans of this research. The study focuses on clinician-facing workflow tools, and outcomes are measured at the clinician level using survey instruments and aggregated EHR-derived metrics.

Ethics approval and consent

IRB-26-0066, approved. All participants will provide informed consent prior to enrollment.

Privacy and data security

Data will be stored on secure institutional systems with access restricted to study personnel. The analytic dataset will be limited to clinician-level survey responses and aggregated clinician-level EHR metrics.

Discussion

Our study aims to evaluate the impact of a an EHR developed generative AI chart summarization tool on cognitive task load in the ambulatory setting through a two-arm RCT over a 3-month period. Results from this study will inform UCLA Health business decision-making and also explore how a generative AI note summarization tool improves physician pre-charting time and reduces burnout.

There are several limitations to this study. The three-month time period is relatively short for the intervention to adequately capture long-term effects of a chart summarization tool. Physician workload also varies by season as well as ranging from full day to half-day clinics, which may also impact ability to familiarize themselves with the tool. Another limitation is that participation is voluntary, limiting generalizability to providers who may be less inclined to engage with new technologies. Furthermore, this study will be conducted within a single academic health system using a specific EHR platform, which may limit the applicability of the results to other healthcare settings or systems.

This study may be subject to several potential biases. First, participants will not be blinded to their assignment. The reliance on self-reported surveys for key secondary outcomes, such as physician burnout, introduces subjectivity that may not fully reflect the impact of a chart summarizing tools. Additionally for users that use the tool but interact with it minimally, usage may not be reflected as accurately, as usage is tracked based on clicking on items within the tool.

To our knowledge, this is the first RCT evaluating the impact of an EHR-developed generative AI tool. These findings will provide insights into its effect on efficiency, productivity and physician well-being in ambulatory care settings.

Article summary – Strengths and limitations of this study.

  • This study is a 3-month pragmatic randomized controlled trial evaluating a native EHR-embedded generative AI tool that summarizes prior clinical notes for ambulatory encounters.

  • The primary outcome uses a validated cognitive task load instrument adapted specifically for pre-charting activities.

  • Exploratory outcomes include objective EHR-derived time metrics, validated psychometric measures of burnout and professional fulfillment, and clinician-reported survey measures assessing perceived usefulness of the tool.

  • The trial is single-centered, which may limit generalizabilty, and the intervention is optional-use and unblinded, which may attenuate observed effects and introduce performance bias.

Funding

This study received no external funding. Dr Mafi was supported by an NIH/NIA Paul B. Beeson Emerging Leaders Career Development Award in Aging (K76AG064392-01A1).

Funding Statement

This study received no external funding. Dr Mafi was supported by an NIH/NIA Paul B. Beeson Emerging Leaders Career Development Award in Aging (K76AG064392-01A1).

Footnotes

Trial Registration

ClinicalTrials.gov registration submitted Feb 20, 2026, and pending review at the time of submission.

Ethics Statement

This study was approved by the UCLA Institutional Review Board (IRB-26-0066). No patient-level data will be collected.

Competing interests

The authors declare no competing interests.

Data availability statement

The datasets generated and analyzed during this study will not be publicly available due to institutional restrictions and the use of internal EHR-derived metrics. De-identified, aggregated data may be made available from the corresponding author upon reasonable request and subject to institutional approvals.

References

  • 1.Ehrenfeld JM, Wanderer JP. Technology as friend or foe? Do electronic health records increase burnout? Current Opinion in Anesthesiology. 2018;31(3):357–60. [DOI] [PubMed] [Google Scholar]
  • 2.Arndt BG, Beasley JW, Watkinson MD, Temte JL, Tuan W-J, Sinsky CA, Gilchrist VJ. Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations. The Annals of Family Medicine. 2017;15(5):419–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gesner E, Gazarian P, Dykes P. The burden and burnout in documenting patient care: an integrative literature review. MEDINFO 2019: Health and wellbeing e-networks for all. 2019:1194–8. [Google Scholar]
  • 4.Peccoralo LA, Kaplan CA, Pietrzak RH, Charney DS, Ripp JA. The impact of time spent on the electronic health record after work and of clerical work on burnout among clinical faculty. Journal of the American Medical Informatics Association. 2021;28(5):938–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bowman CA, Holzer H. EMR Precharting efficiency in internal medicine: a scoping review. Journal of Medical Education and Curricular Development. 2021;8:23821205211032414. [Google Scholar]
  • 6.Trang B. Epic launches AI Charting, potentially scrambling the ambient scribe market: STAT Health Care News; 2026. [Available from: https://www.statnews.com/2026/02/04/epic-ai-charting-ambient-scribe-abridge-microsoft/.
  • 7.Silberlust J, Solanki P, Stevens ER, Genes N, Lim E, Sun K, et al. Artificial intelligence-generated encounter summaries: early insights from ambulatory clinicians at a large academic health system. JAMIA open. 2025;8(5):ooaf096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gerke S, Simon DA, Roman BR. Liability risks of ambient clinical workflows with artificial intelligence for clinicians, hospitals, and manufacturers. JCO Oncology Practice. 2025:OP-24-01060. [Google Scholar]
  • 9.Lukac PJ, Turner W, Vangala S, Chin AT, Khalili J, Shih Y-CT, et al. Ambient AI scribes in clinical practice: a randomized trial. NEJM AI. 2025;2(12):AIoa2501000. [Google Scholar]
  • 10.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023;388(13):1233–9. [DOI] [PubMed] [Google Scholar]
  • 11.Shekar S, Pataranutaporn P, Sarabu C, Cecchi GA, Maes P. People Overtrust AI-Generated Medical Advice despite Low Accuracy. NEJM AI. 2025;2(6):AIoa2300015. [Google Scholar]
  • 12.Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems. 2025;43(2):1–55. [Google Scholar]
  • 13.Garcia P, Ma SP, Shah S, Smith M, Jeong Y, Devon-Sand A, et al. Artificial intelligence–generated draft replies to patient inbox messages. JAMA Network Open. 2024;7(3):e243201–e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zaretsky J, Kim JM, Baskharoun S, Zhao Y, Austrian J, Aphinyanaphongs Y, et al. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA network open. 2024;7(3):e240357–e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Melnick ER, Harry E, Sinsky CA, Dyrbye LN, Wang H, Trockel MT, et al. Perceived electronic health record usability as a predictor of task load and burnout among US physicians: mediation analysis. Journal of medical Internet research. 2020;22(12):e23382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Trockel M, Bohman B, Lesure E, Hamidi MS, Welle D, Roberts L, Shanafelt T. A brief instrument to assess both burnout and professional fulfillment in physicians: reliability and validity, including correlation with self-reported medical errors, in a sample of resident and practicing physicians. Academic Psychiatry. 2018;42(1):11–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Quigley DD, Elliott MN, Qureshi N, Predmore Z, Hays RD. Associations of the Consumer Assessment of Healthcare Providers and Systems (CAHPS) Clinician and Group Survey Scores with Interventions and Site, Provider, and Patient Factors: A Systematic Review of the Evidence. Journal of Patient Experience. 2024;11:23743735241283204. [Google Scholar]
  • 18.Carolus A, Koch MJ, Straka S, Latoschik ME, Wienrich C. MAILS-Meta AI literacy scale: Development and testing of an AI literacy questionnaire based on well-founded competency models and psychological change-and meta-competencies. Computers in Human Behavior: Artificial Humans. 2023;1(2):100014. [Google Scholar]
  • 19.Koch MJ, Carolus A, Wienrich C, Latoschik ME. Meta AI literacy scale: Further validation and development of a short version. Heliyon. 2024;10(21). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and analyzed during this study will not be publicly available due to institutional restrictions and the use of internal EHR-derived metrics. De-identified, aggregated data may be made available from the corresponding author upon reasonable request and subject to institutional approvals.


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES