Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2026 Feb 24:2026.02.12.26346005. [Version 2] doi: 10.64898/2026.02.12.26346005

Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support

Braydon Dymm, Daniel M Goldenholz
PMCID: PMC12934876  PMID: 41757194

Abstract

Importance: Large language models (LLMs) offer potential decision support, but their accuracy varies. Prompt engineering can generally enhance LLM behavior in a clinical context, yet best practices have yet to be formally explored in realistic clinical contexts for neurology. Objective: To evaluate the impact of structured prompting versus naive prompting on the performance of four LLMs (two closed-source: OpenAI GPT-4o, OpenAI o3; three open-source: Meta Llama-4-Scout-17B-16E-Instruct, Llama-3.3-70B-Instruct-Turbo, and the reasoning model r1-1776) for thrombolytic clinical decision support (CDS) in acute stroke. Design: Models responded to three novel ischemic stroke vignettes using either a naive question ("Should this patient be offered thrombolytics?") or a five-step structured prompt (CARDS) guiding information extraction, timing analysis, contraindication checking, decision process explanation, and risk-benefit discussion. Outputs were assessed across seven domains: guideline adherence, unsafe recommendations, risk recognition, guideline grading accuracy, inclusion of conversational explanation, clarity, and overall helpfulness. Results: Structured prompts significantly enhanced performance across most domains, with varying effects between model families. For closed-source models (GPT-4o, o3), prompts structured in the CARDS style improved guideline adherence from 83.3% to 100%, eliminated unsafe recommendations (16.7% to 0%), and increased specific guideline grading accuracy from 0% to 100%. Similarly, the open-source reasoning model r1-1776 achieved these top-tier outcomes (100% adherence, 0% unsafe, 100% grading, 100% conversation) when structured prompts were applied, with grading and conversation improving from 0%. In contrast, other open-source models (Llama-4-Scout, Llama-3.3-70B) showed more modest gains: risk recognition improved (83.3% to 100%) and guideline grading accuracy increased (0% to 66.7%), while guideline adherence (66.7%) and unsafe recommendations (33.3%) persisted. Overall, structured prompting yielded the largest improvements in guideline grading accuracy and conversational reasoning across multiple models. Conclusion and Relevance: Structured prompting substantially enhances LLM performance for acute stroke thrombolysis CDS. Notably, some models, including the proprietary GPT-4o and o3, and the open-source reasoning model r1-1776, achieved excellent safety and adherence with structured prompts. For clinical deployment of any LLM, structured prompts are crucial, and vigilant human oversight remains essential.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES