Optimizing Order Sets With a Large Language Model–Powered Multiagent System

Siru Liu; Sean S Huang; Allison B McCoy; Aileen P Wright; Sara Horst; Adam Wright

doi:10.1001/jamanetworkopen.2025.33277

. 2025 Sep 23;8(9):e2533277. doi: 10.1001/jamanetworkopen.2025.33277

Optimizing Order Sets With a Large Language Model–Powered Multiagent System

Siru Liu ^1,^2,^✉, Sean S Huang ^1,³, Allison B McCoy ¹, Aileen P Wright ^1,³, Sara Horst ³, Adam Wright ^1,³

¹Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee

²Department of Computer Science, Vanderbilt University, Nashville, Tennessee

³Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee

Accepted for Publication: July 19, 2025.

Published: September 23, 2025. doi:10.1001/jamanetworkopen.2025.33277

^✉

Corresponding Author: Siru Liu, PhD, Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, #1475, Nashville, TN 37212 (siru.liu@vumc.org).

Author Contributions: Dr Liu had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Liu, A.P. Wright, Horst, A. Wright.

Acquisition, analysis, or interpretation of data: Liu, Huang, McCoy, A. Wright.

Drafting of the manuscript: Liu, Huang.

Critical review of the manuscript for important intellectual content: All authors.

Statistical analysis: Liu.

Obtained funding: Liu.

Administrative, technical, or material support: Liu, A. P. Wright, A. Wright.

Conflict of Interest Disclosures: Dr McCoy reported grants from the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Advancing Translational Sciences (NCATS) outside the submitted work. Dr Horst reported other support as a consultant for Johnson & Johnson, Pfizer, AbbVie, Takeda, Biocon, and Celltrion, outside the submitted work. No other disclosures were reported.

Funding/Support: This work was supported by National Institutes of Health (NIH) grant R00LM014097-02.

Role of the Funder/Sponsor: The NIH had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Data Sharing Statement: See Supplement 2.

^✉

Corresponding author.

PMCID: PMC12457977 PMID: 40986301

This study evaluates a large language model–powered multiagent system compared with physician preferences to improve the accuracy, relevance, and efficiency of order set optimization in clinical decision support.

Key Points

Question

What is the utility of a large language model–powered multiagent system in generating suggestions to optimize order sets compared with expert evaluation?

Findings

In this cohort study including 735 suggestions for 71 order sets, 96 suggestions for 9 order sets were evaluated by 3 physicians, and 639 suggestions for 62 order sets were evaluated by 1 physician. The median number of useful suggestions per order set was 2 in both evaluations. Among the 96 suggestions, 54% were rated highly accurate (score ≥4), while 19% were rated highly useful, 16% feasible, and 12% as having a direct impact.

Meaning

Results of this study suggest that multiagent systems offer a scalable and effective approach to enhancing order set optimization.

Abstract

Importance

Optimizing order sets is vital to enhance clinical decision support and improve patient care. Manual review is resource intensive and cannot timely identify potential improvements in order sets.

Objective

To develop and evaluate the utility of a large language model (LLM)–powered multiagent system in optimizing order sets.

Design, Setting, and Participants

A multiagent system was developed and evaluated between January 1, 2024, and December 31, 2024, which comprised agents for content critique, dynamic search, knowledge retrieval, medication verification, and suggestion summarization. A filter was developed to align suggestion usefulness scores with expert preferences. Experiment 1 evaluated 735 generated suggestions from a multiagent system developed for optimizing order sets, which were assessed by 3 physicians for 9 order sets and by 1 physician for 62 order sets. Experiment 2 implemented an LLM-as-a-judge approach to align generated suggestions with expert ratings and developed a filter to further refine the system’s performance. The study was performed at Vanderbilt University Medical Center. A total of 735 suggestions for 71 order sets at VUMC were evaluated by 3 physicians.

Main Outcomes and Measures

The ratings of accuracy, usefulness, feasibility, and impact; interrater agreement; and alignment against historical ordering data.

Results

In evaluation 1 of experiment 1, the median values for the number of suggestions scoring 4 or higher at the order set level were 5 (IQR, 5-6) for the metrics of accuracy, 2 (IQR, 1-4) for usefulness, 1 (IQR, 0-3) for feasibility, and 1 (IQR, 0-2) for impact. Of 96 suggestions, 44 (46%; 95% CI, 36%-56%) aligned with historical ordering patterns. In evaluation 2 of experiment 1, 639 suggestions were generated for 62 order sets; 52 order sets had at least 1 useful suggestion, with a median of 2 (IQR, 1-3) useful suggestions. Overall, 122 suggestions (19%; 95% CI, 16%-22%) were rated as useful. After expert alignment, Cohen κ improved from 0.06 to 0.41. Filtering using the aligned scores reduced total suggestions by 29% while retaining 92% of useful suggestions.

Conclusions and Relevance

In this cohort study of an LLM-powered multiagent system for optimizing order sets, leveraging LLMs and multiagent systems provided a scalable approach. Alignment with a small set of expert ratings significantly enhanced the LLM evaluation. Future research could refine reasoning capabilities and integrate useful suggestions into electronic health records, while engaging end-users as artificial intelligence–supported reviewers.

Introduction

Widespread implementation of electronic health records (EHRs) has led to the expansion of clinical decision support (CDS) systems.^1,2 An important part of CDS is order sets within computerized clinician order entry systems, which are organized groups of orders or procedures compiled in 1 place, usually tailored to a specific condition, clinical process, or situation, such as a postoperative order set for knee arthroplasty.^3,4 Well-designed order sets could improve efficiency in ordering, reduce errors, and improve adherence to clinical guidelines.^5,6,7,8

Despite the benefits, order sets, similar to other CDS tools, also need systematic management and monitoring tools to detect malfunctions effectively.⁹ Currently, CDS experts often rely on third-party literature services for updates to clinical guidelines, introducing delays and missed updates. In addition, clinical evidence and practice shift over time.¹⁰ New drugs are introduced or new indications are found for existing drugs¹¹; disease names and classifications are updated or revised^12,13; and clinical procedures are added, changed, or reorganized.^14,15,16,17 As an example, Vanderbilt University Medical Center (VUMC) currently maintains 1496 order sets for various clinical scenarios. Manual review of order sets to ensure they match up-to-date evidence is a resource-intensive process.

Large language models (LLMs) trained on large amounts of text have shown strong text-processing capabilities.¹⁸ In CDS, LLMs have been successfully used to critique alert criteria and summarize user comments on alerts.^19,20 Multiagent systems consist of intelligent agents that collaborate to solve complex problems, and the capabilities of LLM agents are enhanced through iterative feedback and teamwork.^21,22 They have demonstrated good performance in health monitoring and privacy-preserving clinical data sharing.^21,23,24 This study aimed to develop and evaluate an LLM-powered multiagent system aligned with expert preferences to improve the accuracy, relevance, and efficiency of order set optimization in CDS.

Methods

This cohort study was reported following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines.²⁵ This study was performed at VUMC and approved as exempt from informed patient consent by the Vanderbilt Institutional Review Board because it involved secondary analysis of deidentified data and posed minimal risk to participants. The development of the multiagent system and the expert evaluations were performed in 2024. For the validation analysis, historical order data were extracted for the period between January 1 and December 31, 2024. We used Generative Pre-trained Transformer 4o (GPT-4o; OpenAI),²⁶ where “o” denotes “omni,” a Microsoft Azure–hosted environment (Azure; Microsoft) approved for protected health information. Two experiments were conducted: (1) developing and evaluating a multiagent system to generate order set optimization suggestions and (2) developing and evaluating a customized filter to align these suggestions with expert preferences.

Development and Evaluation of a Multiagent System

Multiagent System Development

We developed an LLM-powered multiagent system, including 5 agents (content critic agent, dynamic search agent, knowledge retrieval agent, medication verification agent, and the suggestion summarizer agent) (Figure 1). This architecture uses a retrieval-augmented generation process, where agents retrieve current external information to ground suggestions in clinical evidence. It could overcome knowledge cutoff limits of the base LLM, enabling adaptation to future medical changes without retraining. Our approach did not involve fine-tuning the base GPT-4o model. Instead, we used prompt engineering, where each agent was given a specific role and a detailed set of instructions. The agent interactions were designed as a structured sequential conversation facilitated by a manager agent. This manager selected a designated speaker at each step, whose output was then broadcast to all other agents to inform the next stage of the process. We implemented the multiagent conversation framework using an open-source tool (AutoGen; Microsoft).²⁷

Figure 1. — The diagram illustrates the development and evaluation of the multiagent system. During development, 45 order sets were used. The system architecture consists of 5 agents—content critic, dynamic search, medication verification, knowledge retrieval, and suggestion summarizer—which interact to generate suggestions. The evaluation was conducted in 2 phases. Evaluation 1 involved 9 order sets, generating 96 suggestions rated by 3 physicians on accuracy, feasibility, usefulness, and impact. Evaluation 2 expanded to 62 order sets and 639 suggestions, with 1 physician identifying the useful ones. API indicates application program interface; NLM, National Library of Medicine.

To optimize both the prompts and the agent interactions, we developed a refinement process using 45 VUMC order sets that were manually altered. For each set, we made controlled modifications—removing 1 correct medication and adding 1 unrelated medication—to create a reference standard with 2 expected suggestions. We then processed these sets through the multiagent system, reviewed the output, and iteratively adjusted the prompts and conversational flow to refine the system’s performance. The final prompts for each agent are available in eAppendix 1 in Supplement 1.

The content critic agent optimized order sets for clinical accuracy and relevance. It first reviewed the title to identify the disease and clinical scenario (eg, ambulatory, inpatient care) addressed by the order set. The agent then evaluated each included order to understand its structure and intended actions. For medication orders, it ensured that all relevant medications were explicitly listed by name, adding specific drugs if a general category (eg, pain medications) was used, and suggesting additional medications when necessary. Additionally, the agent identified outdated or inappropriate items. Through this detailed review process, the content critic agent enhanced the order set’s clinical accuracy using its own LLM. This initial analysis was subsequently combined with and validated against timely evidence provided by the knowledge retrieval agent.

The dynamic search agent played a key role in ensuring order sets were informed by the latest clinical knowledge. Using tools such as PubMed search and JavaScript, the agent retrieved full-text articles and clinical guidelines relevant to the order sets. These resources were saved in designated folders, enabling seamless follow-up retrieval and analysis by the knowledge retrieval agent.

The knowledge retrieval agent retrieved content from various sources, including Journal Watch from the New England Journal of Medicine, which summarizes newly published guidelines and articles from over 250 journals²⁸; we extracted all summaries from January 2022 to July 2024. Other sources included the Pocket Medicine (7th edition), a reliable reference for accurate diagnosis and treatment planning in internal medicine, compiled by physicians at Massachusetts General Hospital.²⁹ We also included StatPearls, a clinical support tool containing articles on diseases, drugs, and procedures, extracted from the National Center for Biotechnology Information Bookshelf.³⁰ Another source was Vanderbilt Internal Medicine Residency Handbook (VIMBook), a yearly updated, peer-reviewed guide offering systems-based practice guidance at VUMC.³¹ The last source was PubMed articles and clinical guidelines previously saved by the dynamic search agent. This agent used the Chroma vector database to retrieve the 5 most relevant documents based on semantic similarity. This process enables the agent to generate targeted evidence-based suggestions, ensuring that order sets are based on comprehensive current clinical knowledge.

The medication verification agent used the RxNav application program interface (API) to extract medication class information and the Bing Search API to confirm current market availability. Finally, the suggestion summarizer agent synthesized the inputs from all agents to produce a final list of recommendations with confidence and importance scores.

Multiagent System Evaluation

In evaluation 1, 3 EHR software (Epic; Epic Systems)–certified physician builders (S.S.H., A.P.W., S.H.) rated suggestions from 9 order sets (selected based on 2023-2024 usage frequency) on a 1 to 5 Likert scale for accuracy, usefulness, feasibility, and impact, plus yes/no questions on redundancy and hallucination. The definition for each metric is provided in eTable 1 in Supplement 1. Informed consent (oral) was obtained from physician experts. No participants were lost to follow-up. We also analyzed historical data to understand how suggestions aligned with clinical practice. We extracted order data from the 2024 calendar year for encounters where each order set was used and calculated the 25th to 75th percentile usage counts for all associated orders. We then used this to assess alignment: a suggestion to add an item was considered aligned with frequent use if its count was above the 75th percentile, while a removal suggestion was considered aligned if its count was below the 25th percentile. Additionally, we performed a sensitivity analysis to determine the number of aligned suggestions across different percentile combinations (eg, 10th-90th, 35th-65th). In evaluation 2, 1 physician (S.S.H.) reviewed all suggestions for another set of 62 order sets to identify useful ones.

Development and Evaluation of a Customized Filter for Aligning Generated Suggestions With Expert Preferences

This filter functions as a postprocessing layer to evaluate and rank suggestions after they have been generated. The filter was constructed and evaluated in the following steps.

The first step was the reference standard creation. A physician reviewed all 639 suggestions from evaluation 2 and assigned a binary rating (1 for useful, 0 for not useful). The second step was LLM-as-a-judge alignment and scoring. We used an LLM-as-a-judge approach (prompt in eAppendix 2 in Supplement 1) (Figure 2).³² A separate GPT-4o model was prompted to act as a judge and score the usefulness of each suggestion. To align this judge with expert preferences, we used few-shot prompting, in which annotated examples are provided in the prompt to guide the model’s responses, drawing from the 96 physician-annotated cases (evaluation 1). This postalignment judge then provided a new usefulness score for each suggestion. The third step was filter construction. We trained a logistic regression model using the postalignment usefulness score as the probability variable and the physician’s binary rating (from step 1) as the outcome. The fourth step was filter application and evaluation. The trained model was then applied as a filtering mechanism. By adjusting the model’s probability threshold, we could filter out suggestions with a low probability of being deemed useful by an expert. A comprehensive description of the model, threshold selection rationale, and uncertainty quantification is provided in eAppendix 4 in Supplement 1.

Figure 2. — This flowchart details the LLM-as-a-judge process for creating a customized filter. An LLM first assesses the usefulness of 639 suggestions from 62 order sets. These initial scores are then refined through an expert preference alignment process, which uses physician ratings and comments from a smaller dataset (96 suggestions). The agreement between the LLM's pre- and postalignment scores and the physician's ratings is measured using Cohen κ. This process results in a customized filter designed to retain a higher proportion of suggestions deemed useful by experts. AI indicates artificial intelligence; LLM, large language model.

Statistical Analysis

The Mann-Whitney U test was used to compare the pre- and postalignment usefulness scores against the physician’s evaluations.³³ To measure the level of agreement between the LLM’s scores and the physician’s ratings, we calculated Cohen κ, a robust metric that accounts for agreement occurring by chance.³⁴ Furthermore, logistic regression analysis was conducted to model the association between the postalignment usefulness scores and the physician’s binary usefulness rating from experiment 2 (outcome variable). Statistical analyses were conducted in Python version 3.11 (Python Software Foundation) using the packages scipy.stats, statsmodels, and scikit-learn, with a significance threshold of P < .001 (2-sided).

Results

Development and Evaluation of a Multiagent System

The knowledge base included 51 562 594 words (Table 1). From the 45 manually modified order sets, the system correctly generated suggestions for 41 (91%; 95% CI, 83%-99%) of the removed medications and 38 (84%; 95% CI, 74%-95%) of the added medications. This resulted in 88% accuracy on this specific development dataset, which we used to finalize the system’s prompt engineering before the main evaluation.

Table 1. Data Sources Used in the Knowledge Retrieval Agent.

Source	Articles or pages, No.	Words, No.
NEJM Journal Watch (General Medicine) January 2022–July 2024	1275 articles	340 865
NEJM Journal Watch (Guideline Watch) January 2022–July 2024	90 articles	43 156
StatPearls^a	9378 articles	50 919 241
Pocket Medicine	522 pages	133 903
Vanderbilt Internal Medicine Residency Handbook (VIMBook)	869 pages	125 429

Open in a new tab

Abbreviation: NEJM, New England Journal of Medicine.

^{^a}

Represents the entire corpus downloaded from the National Center for Biotechnology Information Bookshelf.

In evaluation 1 of experiment 1, across the 9 most frequently used order sets, the median number of generated suggestions per order set was 10 (IQR, 8-12). The medians for suggestions scoring 4 or higher were 5 (IQR, 5-6) for accuracy, 2 (IQR, 1-4) for usefulness, 1 (IQR, 0-3) for feasibility, and 1 (IQR, 0-2) for impact. eTable 2 in Supplement 1 details the number of suggestions scoring 4 or more for each metric.

For the 96 suggestions generated for the initial 9 order sets, a high proportion were rated as accurate (54% scoring ≥4), while fewer were rated as highly useful (19%), feasible (16%), or direct impact (12%). Figure 3 illustrates the distribution of these ratings across metrics in a stacked bar chart. No suggestions were identified as hallucinations, although 11% were redundant. There were significant differences among rater scores (P < .001); for example, rater 1 and rater 3 most frequently agreed on a score of 2 (n = 229), whereas rater 1 assigned a score of 4 in 44 cases that rater 3 scored as 2. Confusion matrices for pairwise comparisons are provided in eFigure 1 in Supplement 1. Our analysis found that of 96 suggestions, 44 (46%; 95% CI, 36%-56%) aligned with historical ordering patterns (eg, suggestion to add apixaban to the order set aligned with data showing that apixaban was often ordered separately from the order set when the order set was used.) A sensitivity analysis showed the number of aligned suggestions changing from 18 (19%) at a 10th-90th threshold to 39 (41%) at a 30th-70th threshold (eTable 3 in Supplement 1). The estimated cost for analyzing a single order set ranges from $0.29 to $0.58, depending on the complexity of the order set and the number of agent interactions required.

Figure 3. — This divergent bar chart displays the distribution of ratings from physicians for 96 artificial intelligence (AI)–generated suggestions based on a 5-point scale from “strongly disagree” to “strongly agree.” The suggestions are evaluated across 4 metrics: impact, feasibility, usefulness, and accuracy. AI indicates artificial intelligence.

In evaluation 2 of experiment 1, the multiagent system generated 639 suggestions across 62 order sets, with 52 sets receiving at least 1 useful suggestion. The median of the number of useful suggestions at the order set level was 2 (IQR, 1-3). Specifically, 4 order sets had up to 5 useful suggestions and 8 order sets had 4. Overall, 122 suggestions (19%; 95% CI, 16%-22%) were rated as useful. Table 2 lists examples of useful suggestions. For example, we submitted a ticket to modify the total knee arthroplasty postoperative focused orders order set based on the suggestion “Add Apixaban (Eliquis) as an alternative anticoagulant option.” A detailed breakdown of these suggestions by clinical scenario and suggestion type is provided in eTable 4 (clinical scenario) and eTable 5 (suggestion type) in Supplement 1.

Table 2. Examples of Useful Suggestions.

Order set	Generated suggestion		Generated rationale
Order set	Text	Usefulness
Total knee arthroplasty postoperative focused orders	Add apixaban (Eliquis) as an alternative anticoagulant option.	4/5^a	Apixaban is a well-tolerated anticoagulant that can be used as an alternative to other anticoagulants in postoperative care.
Adult hemodialysis orders	Evaluate the necessity of sodium bicarbonate 8.4 injection for blood pressure support, as it is typically used for metabolic acidosis.	3/5^a	Sodium bicarbonate is not typically used for blood pressure support, and its inclusion should be reviewed for appropriateness.
Adult post–lung transplant routine clinic protocol	Add glycated hemoglobin for patients with diabetes or at risk of developing diabetes due to immunosuppressive therapy.	4/5^a	Monitoring long-term glucose control is important in managing diabetes.
Neurology admission orders	Consider adding a lipid panel for patients at risk for stroke or cardiovascular disease.	4/5^a	A lipid panel can help assess cardiovascular risk, which is relevant for neurology patients.
Neurology admission orders	Consider adding carotid ultrasound for patients at risk for stroke or transient ischemic attacks.	4/5^a	Carotid ultrasound can help assess stroke risk and guide management decisions.
Low back pain care path—chronic	Add acetaminophen to the medication list.	1^b	Acetaminophen is often used as a first-line treatment for pain management and can be used in conjunction with NSAIDs.

Open in a new tab

Abbreviation: NSAID, nonsteroidal anti-inflammatory drug.

^{^a}

Usefulness rated on a 1-5 Likert scale (from evaluation 1).

^{^b}

Usefulness rated on a binary scale where 1 indicates “useful” (from evaluation 2).

Development and Evaluation of a Customized Filter for Aligning Generated Suggestions With Expert Preferences

Before alignment, the Mann-Whitney U test showed no association between LLM-generated usefulness scores and rater ratings. After alignment, an association was found between updated scores and ratings. Cohen κ improved from 0.06 (poor agreement) to 0.41 (moderate agreement) after alignment (eFigure 2 in Supplement 1). The finding for updated usefulness scores and rater ratings was significant using the logistic regression model: χ²₁ = 71.36. The updated usefulness score was positively associated with the likelihood of the outcome (β = 0.05; 95% CI, 0.03-0.06). Using this model as a filter, a threshold of 0.11 retained 453 suggestions (a 29% reduction) while preserving 92% of useful suggestions. Detailed results are shown in eTable 6 in Supplement 1.

Discussion

The results of this cohort study suggest the feasibility of an LLM-powered multiagent system for optimizing order sets. The primary value of this system is its ability to provide a systematic and scalable evidence-grounded foundation for a task that is traditionally a resource-intensive manual review. By automatically generating a targeted list of suggestions, the system shifts the expert’s role from manual discovery to efficient validation, addressing the challenges of scale and currency inherent in the manual process.

This system allowed for customization by integrating external resources based on specific needs, and tailoring prompts or system architecture to achieve diverse optimization goals. Such flexibility enabled CDS experts at different institutions to adapt the multiagent system based on their local conditions. The current manual process of optimizing order sets is highly resource intensive. While experts might be able to identify incorrect items in a long order set, determining what was missing was far more challenging, particularly given the continual updates in clinical evidence and the complexity of the order set in both content and in structure. A multiagent system, however, systematically reviewed every item in every order set and compared it with external evidence, regardless of the number of order sets.

This study demonstrated the ability of an LLM-powered multiagent system to generate suggestions that are relevant to patient care. For example, in the total knee arthroplasty postoperative focused orders order set, the system suggested adding apixaban as an alternative anticoagulant option. This was rated as highly useful because it is clinically appropriate, aligns with current evidence, and represents a direct actionable improvement to the order set. As a contrasting example, the suggestion to add a baseline laboratory panel to the behavioral health admission orders received a low usefulness rating. Reviewer feedback pointed out the suggestion’s ambiguity; it was unclear whether this particular order matched the correct clinical workflow for users of the order set. Additionally, some suggestions, while less directly useful, could inspire experts and might also be highly relevant to patient care. For instance, in the adult post–lung transplant routine clinic protocol order set, the system suggested considering medications such as metformin or insulin for managing posttransplant diabetes. One physician noted that, although prescribing metformin might not be directly applicable, the suggestion raised a question about whether glycated hemoglobin testing should be included in the order set.

The modest useful suggestion rate (19%) highlights a key challenge: many suggestions, while factually correct, lacked the specific clinical context to be deemed useful. This result is partly due to the system’s design, which prioritized generating a larger pool of suggestions to avoid omitting valuable insights. This may be due to factors such as order set complexity and maturity, since frequently used sets often have been completed multiple rounds of manual optimization, leaving limited room for further improvement. Future research could therefore focus on integrating a deeper understanding of clinical workflows and institutional priorities to better align technical accuracy with practical utility. Importantly, the retrieval-augmented generation architecture likely was a factor associated with the absence of factual hallucinations.

Relying solely on an LLM to evaluate suggestion usefulness is insufficient. Optimizing order sets is a complex task, requiring a deep understanding of clinical workflows and domain-specific knowledge. Physicians noted difficulties rating suggestions for highly specific scenarios, leading to discrepancies between LLM scores and their ratings. This observation is matched with the initial inconsistency between LLM-generated usefulness scores and physician ratings. Our findings showed an association between incorporating a small set of physician ratings and the accuracy of LLM-generated usefulness scores. After alignment, Cohen κ improved from poor to moderate agreement across a large dataset. It is consistent with previous findings showing enhanced GPT-4o performance in a triage task after expert alignment.³⁵

A proposed clinical implementation model, illustrated by our prototypes in eFigures 3 and 4 in Supplement 1, was designed to address practical challenges. The proposed workflow integration involves periodic automated suggestion generation, which a CDS expert triages and assigns to relevant physician specialists for an efficient asynchronous review. A step-by-step description is available in eAppendix 3 in Supplement 1.

Limitations

This study has several limitations. Conducted at a single academic center, the generalizability of the findings may be limited, although the use of the EHR infrastructure enhances technical transferability. The high-usage order sets selected may not be representative, and the small number of physician builder evaluators may not reflect all clinician perspectives. Our design did not systematically quantify false negatives, nor did we directly evaluate the traceability of suggestions to their source. Future research could compare this multiagent architecture with a single-agent system, enhance the reasoning capabilities and knowledge base,³⁶ and implement the framework detailed in eAppendix 5 in Supplement 1 to quantitatively measure suggestion faithfulness, building clinician trust and facilitating EHR integration.

Conclusions

In this cohort study, leveraging LLMs and multiagent systems provided a systematic and scalable approach to optimizing order sets but faced challenges in addressing specific clinical needs. Alignment with a small set of expert ratings was associated with stronger LLM evaluation capabilities. Future research could focus on refining reasoning capabilities, expanding the knowledge base, and facilitating the integration of useful suggestions into EHRs, while actively engaging end-users of order sets as human reviewers supported by artificial intelligence.

Supplement 1.

eTable 1. Definitions for suggestion rating criteria

eFigure 1. Confusion matrices for pairwise comparisons

eAppendix 1. Prompts for each agent in the multi-agent system

eAppendix 2. Prompts for evaluating the usefulness of suggestions

eTable 2. Number of suggestions scoring 4 or higher for each metric at the order set level

eFigure 2. Comparison of Cohen’s Kappa values across various thresholds before and after alignment

eTable 3. Sensitivity analysis of percentile thresholds for suggestion validation

eTable 4. Distribution and usefulness of generated suggestions by clinical scenario

eTable 5. Distribution and usefulness of generated suggestions by suggestion type

eTable 6. Performance of the filter on the multi-agent system across different probability thresholds

eFigure 3. Physician-facing user interface prototype for reviewing generated suggestions. The interface allows physicians to evaluate suggestions for order sets and provide feedback

eFigure 4. CDS expert-facing user interface prototype for overseeing the review process

eAppendix 3. Proposed real-world implementation and maintenance workflow

eAppendix 4. Detailed methodology and validation of the logistic regression filter

eAppendix 5. Proposed framework for quantitatively evaluating the evidence traceability of generated suggestions

jamanetwopen-e2533277-s001.pdf^{(749.5KB, pdf)}

Supplement 2.

Data Sharing Statement

jamanetwopen-e2533277-s002.pdf^{(14.7KB, pdf)}

References

1.Sorace J, Wong HH, DeLeire T, et al. Quantifying the competitiveness of the electronic health record market and its implications for interoperability. Int J Med Inform. 2020;136:104037. doi: 10.1016/j.ijmedinf.2019.104037 [DOI] [PubMed] [Google Scholar]
2.Liu S, Reese TJ, Kawamoto K, Del Fiol G, Weir C. A theory-based meta-regression of factors influencing clinical decision support adoption and implementation. J Am Med Inform Assoc. 2021;28(11):2514-2522. doi: 10.1093/jamia/ocab160 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kuperman GJ, Gibson RF. Computer physician order entry: benefits, costs, and issues. Ann Intern Med. 2003;139(1):31-39. doi: 10.7326/0003-4819-139-1-200307010-00010 [DOI] [PubMed] [Google Scholar]
4.Wright A, Sittig DF, Carpenter JD, Krall MA, Pang JE, Middleton B. Order sets in computerized physician order entry systems: an analysis of seven sites. AMIA Annu Symp Proc. 2010;2010:892-896. [PMC free article] [PubMed] [Google Scholar]
5.Bates DW, Leape LL, Cullen DJ, et al. Effect of computerized physician order entry and a team intervention on prevention of serious medication errors. JAMA. 1998;280(15):1311-1316. doi: 10.1001/jama.280.15.1311 [DOI] [PubMed] [Google Scholar]
6.McGreevey JD III. Order sets in electronic health records: principles of good practice. Chest. 2013;143(1):228-235. doi: 10.1378/chest.12-0949 [DOI] [PubMed] [Google Scholar]
7.Munasinghe RL, Arsene C, Abraham TK, Zidan M, Siddique M. Improving the utilization of admission order sets in a computerized physician order entry system by integrating modular disease specific order subsets into a general medicine admission order set. J Am Med Inform Assoc. 2011;18(3):322-326. doi: 10.1136/amiajnl-2010-000066 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Ozdas A, Speroff T, Waitman LR, Ozbolt J, Butler J, Miller RA. Integrating “best of care” protocols into clinicians’ workflow via care provider order entry: impact on quality-of-care indicators for acute myocardial infarction. J Am Med Inform Assoc. 2006;13(2):188-196. doi: 10.1197/jamia.M1656 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Wright A, Hickman TTT, McEvoy D, et al. Analysis of clinical decision support system malfunctions: a case series and survey. J Am Med Inform Assoc. 2016;23(6):1068-1076. doi: 10.1093/jamia/ocw005 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Murad MH. Clinical practice guidelines: a primer on development and dissemination. Mayo Clin Proc. 2017;92(3):423-433. doi: 10.1016/j.mayocp.2017.01.001 [DOI] [PubMed] [Google Scholar]
11.Batta A, Kalra BS, Khirasaria R. Trends in FDA drug approvals over last 2 decades: an observational study. J Family Med Prim Care. 2020;9(1):105-114. doi: 10.4103/jfmpc.jfmpc_578_19 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Popovsky MA, Abel MD, Moore SB. Transfusion-related acute lung injury associated with passive transfer of antileukocyte antibodies. Am Rev Respir Dis. 1983;128(1):185-189. doi: 10.1164/arrd.1983.128.1.185 [DOI] [PubMed] [Google Scholar]
13.Falk RJ, Gross WL, Guillevin L, et al. ; American College of Rheumatology; American Society of Nephrology; European League Against Rheumatism . Granulomatosis with polyangiitis (Wegener’s): an alternative name for Wegener’s granulomatosis. Arthritis Rheum. 2011;63(4):863-864. doi: 10.1002/art.30286 [DOI] [PubMed] [Google Scholar]
14.Klann JG, Phillips LC, Turchin A, Weiler S, Mandl KD, Murphy SN. A numerical similarity approach for using retired Current Procedural Terminology (CPT) codes for electronic phenotyping in the Scalable Collaborative Infrastructure for a Learning Health System (SCILHS). BMC Med Inform Decis Mak. 2015;15(1):104. doi: 10.1186/s12911-015-0223-x [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Feinstein AR. ICD, POR, and DRG. Unsolved scientific problems in the nosology of clinical medicine. Arch Intern Med. 1988;148(10):2269-2274. doi: 10.1001/archinte.1988.00380100113024 [DOI] [PubMed] [Google Scholar]
16.Kveim Lie A, Greene JA. From Ariadne’s thread to the labyrinth itself—nosology and the infrastructure of modern medicine. N Engl J Med. 2020;382(13):1273-1277. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Doshi-Velez F, Ge Y, Kohane I. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics. 2014;133(1):e54-e63. doi: 10.1542/peds.2013-0819 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Brown TB, Mann B, Ryder N, et al. Language Models Are Few-Shot Learners. Curran Associates, Inc; 2020. [Google Scholar]
19.Liu S, Wright AP, Patterson BL, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc. 2023;30(7):1237-1245. doi: 10.1093/jamia/ocad072 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Liu S, McCoy AB, Wright AP, et al. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc. 2024;31(6):1388-1396. doi: 10.1093/jamia/ocae041 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Humayun M, Jhanjhi NZ, Almotilag A, Almufareh MF. Agent-based medical health monitoring system. Sensors (Basel). 2022;22(8):2820. doi: 10.3390/s22082820 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Xi Z, Chen W, Guo X, et al. The rise and potential of large language model based agents: a survey. Published online September 14, 2023. Accessed October 9, 2024. https://arxiv.org/abs/2309.07864v3
23.Wimmer H, Yoon VY, Sugumaran V. A multi-agent system to support evidence based medicine and clinical decision making via data sharing and data privacy. Decis Support Syst. 2016;88:51-66. doi: 10.1016/j.dss.2016.05.008 [DOI] [Google Scholar]
24.Wang Z, Mao S, Wu W, Ge T, Wei F, Ji H. Unleashing the emergent cognitive synergy in large language models: a task-solving agent through multi-persona self-collaboration. NAACL. June 2024. Accessed August 15, 2025. https://aclanthology.org/2024.naacl-long.15/
25.von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative . Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007;335(7624):806-808. doi: 10.1136/bmj.39335.541782.AD [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Achiam J, Adler S. GPT-4 technical report. OpenAI. Published online March 15, 2023. Accessed October 9, 2024. https://cdn.openai.com/papers/gpt-4.pdf
27.AutoGen. Accessed September 30, 2024. https://microsoft.github.io/autogen/
28.New England Journal of Medicine (NEJM). Journal Watch. 2023. Accessed April 17, 2024. https://www.jwatch.org/
29.Sabatine MS. Pocket Medicine: The Massachusetts General Hospital Handbook of Internal Medicine. 7th ed. Wolters Kluwer; 2019. Accessed October 9, 2024. https://www.bibguru.com/b/how-to-cite-pocket-medicine/ [Google Scholar]
30.StatPearls . Accessed April 10, 2024. https://www.statpearls.com/home/index
31.Lessans S, Giannini J, Bogdanovski K, eds. Vanderbilt Internal Medicine Residency Handbook. 6th ed. Vanderbilt University; 2024. Accessed October 9, 2024. https://vim-book.org/ [Google Scholar]
32.Zheng L, Chiang WL, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Published online June 9, 2023. Accessed April 4, 2024. https://arxiv.org/abs/2306.05685v4
33.de Winter JCF, Dodou D. Five-point Likert items: t test versus Mann-Whitney-Wilcoxon (Addendum added October 2012). Practical Assessment, Research, and Evaluation. 2010;15(1):11. [Google Scholar]
34.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22(3):276-282. doi: 10.11613/BM.2012.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Kohane I. Systematic characterization of the effectiveness of alignment in large language models for categorical decisions. Published online September 18, 2024. http://arxiv.org/abs/2409.18995 doi: 10.1101/2024.09.27.24314486 [DOI]
36.Liu S, McCoy AB, Wright A. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc. 2025;32(4):605-615. doi: 10.1093/jamia/ocaf008 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials