Abstract
This quality improvement study evaluates clinician perspectives on the usability and utility of generative artificial intelligence (AI)–based large language model tool to draft result comments for laboratory, imaging, and pathology results.
Introduction
The 21st Century Cures Act mandating immediate release of test results to patients enhanced transparency but also led to patient anxiety.1,2 Patients prefer results explained directly by clinicians.2 Generative artificial intelligence (AI) provides an opportunity to enhance patient understanding of test results3 while reducing clinician burden and burnout.4,5 Stanford Health Care developed a novel generative AI-based large language model (LLM) tool to draft result comments for laboratory, imaging, and pathology results. This study evaluated clinician perspectives from a pilot implementation guided by the RE-AIM/PRISM framework,5,6 assessing usability, utility, and suggestions for tool improvement.
Methods
This prospective quality improvement study was conducted from January 13 through March 31, 2025, and followed the SQUIRE reporting guideline. The institutional review board (IRB) at Stanford University determined that this study met the criteria for quality improvement and was exempt from the need for IRB–mandated consent. Stanford Health Care developed an electronic health record–integrated tool for drafting result comments similar in concept to a previously studied AI inbox message replies tool.5 Claude 3.5 Sonnet (Anthropic) was selected as the LLM for the pilot based on its superior response time, fidelity to prompt instructions, and ability to generate outputs resembling result comments from clinicians. Primary care clinicians across both faculty and community practice networks were invited to participate in 2 waves staggered by 4 weeks, with postsurveys administered after 4 or 8 weeks of tool use. Surveys, guided by RE-AIM/PRISM, assessed usability and utility (5-point Likert scale), and perceived time impact per result comment (eMethods in Supplement 1). Descriptive statistics were used to summarize Likert responses; “strongly agree” and “agree” responses were combined to report survey results. Free-text survey responses were systematically analyzed by 2 researchers (S.J.S., K.M.) using deductive thematic coding followed by inductive theme identification during consensus reconciliation. Comments were segmented into phrases; assigned a positive, negative, or neutral sentiment; and summarized. Phrases were allowed to have multiple codes. Microsoft Excel was used for analysis.
Results
Of 244 clinicians who used the tool at least once, 93 (38.1%) completed postsurveys (62 [66.7%] female; 31 [33.3%] male; 46 [49.5%] with ≥15 years after training in practice). Clinicians reported favorable usability (79 [84.9%] found the tool easy to use) and utility, particularly for laboratory (67 [72.0%]) and imaging (59 [63.4%]) results, improved efficiency (66 [71.0%]), and higher quality explanations (67 [72.0%]) (Figure). Over half reported using the tool frequently (54 [58.1%]) and that the tool was ready for broad implementation (50 [53.8%]). Most (77 [82.8%]) anticipated long-term use, and 39 (41.9%) felt motivated to send result comments to patients more frequently. Median perceived time savings per result was 1.1 minutes (range, 5.0 minutes saved to 3.3 minutes additional time spent).
Figure. Postsurvey Likert Scale Results From 93 Respondents.

Themes from free-text responses included positive sentiments regarding tool utility and patient engagement and negative sentiments related to content accuracy and completeness (Table). Clinicians offered specific suggestions for tool optimization (eg, including more patient context from recent visit notes) and improved workflow integration (eg, updating drafts sequentially for results released over time).
Table. Qualitative Encoding of Free-Text Comments From Surveys.
| Theme | Representative quotations | Comments, No. | |||
|---|---|---|---|---|---|
| Negative | Neutral | Positive | Total | ||
| Overall tool utility |
|
2 | 3 | 20 | 25 |
| Content accuracy |
|
13 | 1 | 2 | 16 |
| Content completeness |
|
6 | 1 | 4 | 11 |
| Impact on workflow |
|
3 | 1 | 6 | 10 |
| Draft result comment length |
|
6 | 0 | 1 | 7 |
| Future use and readiness for scale |
|
2 | 0 | 5 | 7 |
| Impact on patient engagement |
|
1 | 0 | 6 | 7 |
| Impact on time and efficiency |
|
1 | 1 | 4 | 6 |
| Voice and tone |
|
1 | 1 | 2 | 4 |
| Utility specific to imaging results |
|
0 | 1 | 2 | 3 |
| Utility specific to lab results |
|
3 | 0 | 0 | 3 |
| Utility specific to pathology results | Neutral: “I’m not sure whether it works well for pathology—not enough experience!” | 0 | 2 | 0 | 2 |
| Total | NA | 38 | 11 | 53 | 102 |
Abbreviations: AI, artificial intelligence; NA, not applicable.
Discussion
This study demonstrated the utility of a generative AI tool for drafting test result explanations, highlighting ease of use, improved efficiency, and higher-quality explanations. Barriers to adoption included content accuracy and completeness. Limitations include selection bias, limited generalizability beyond primary care, and underrepresentation of certain test types (eg, pathology). While these early results suggest that AI-generated draft result comments could help reduce clinician burden and enhance patient experience, further optimization grounded in clinician feedback is needed to improve accuracy and completeness of draft explanations. Additional improvements should focus on optimizing prompts, updating the LLM, incorporating patient-specific context, and streamlining workflow integration. Future evaluations should quantify impacts on clinician inbox burden (time spent, message volume) and consider patient perspectives.
eMethods. Survey
Data Sharing Statement
References
- 1.Steitz BD, Turer RW, Lin CT, et al. Perspectives of patients about immediate access to test results through an online patient portal. JAMA Netw Open. 2023;6(3):e233572. doi: 10.1001/jamanetworkopen.2023.3572 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lustria MLA, Aliche O, Killian MO, He Z. Enhancing patient engagement and understanding: is providing direct access to laboratory results through patient portals adequate? JAMIA Open. 2025;8(2):ooaf009. doi: 10.1093/jamiaopen/ooaf009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.He Z, Bhasuran B, Jin Q, et al. Quality of answers of generative large language models versus peer users for interpreting laboratory test results for lay patients: evaluation study. J Med Internet Res. 2024;26:e56655. doi: 10.2196/56655 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Holmgren AJ, Downing NL, Tang M, Sharp C, Longhurst C, Huckman RS. Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use. J Am Med Inform Assoc. 2022;29(3):453-460. doi: 10.1093/jamia/ocab268 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Garcia P, Ma SP, Shah S, et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open. 2024;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chan SL, Lee JW, Ong MEH, et al. Implementation of prediction models in the emergency department from an implementation science perspective-determinants, outcomes, and real-world impact: a scoping review. Ann Emerg Med. 2023;82(1):22-36. doi: 10.1016/j.annemergmed.2023.02.001 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eMethods. Survey
Data Sharing Statement
