As highlighted in a recent article [1], the outputs of large language models (LLMs) are significantly influenced by specific prompts and how they are employed. Therefore, developing effective strategies to optimize the prompting process and transparently reporting these methods is crucial for successful clinical research studies involving LLMs [1]. A substantial body of literature exists on ‘prompt engineering’, many of which present innovative techniques or variations for prompting. However, numerous methodologies remain complex and opaque for clinicians who do not have a background in computer science. This article aims to introduce the basic methods for optimizing prompting and to serve as a quick-start manual for clinical researchers, recounting our recent experience [2]. We begin with a general overview of prompt engineering, abstaining from deep dives into specific techniques, and invite our readers to explore further resources for a comprehensive understanding.
Prompt Engineering
In the context of generative artificial intelligence (AI), a prompt is an input to a generative AI model designed to generate an output [3]. Simply put, it is the user’s input to the AI model, which, for LLMs, is the input text. Prompt engineering for LLMs essentially involves language engineering, where experiments are conducted with the syntax and content of the prompt to achieve the desired result. One might ask, why do we need to engineer language rather than simply state what we want the LLM to do? This is because of three reasons: prompt brittleness, the LLM’s insufficient domain knowledge, and the intrinsic capability of the LLM.
Even with perfectly precise and clear instructions, a prompt may not perform well, particularly in weaker LLMs. Surprisingly, even minor changes, such as the addition of a single word or space, can lead to substantial discrepancies. This sensitivity to seemingly trivial alterations in prompt construction is known as ‘prompt brittleness’ [4]. For example, although “Calculate the LI-RADS category” and “Determine the LI-RADS category” are semantically identical, the change of one word can drastically impact the model’s output (Fig. 1). Practically, this means that multiple trials and errors are required to determine the optimal syntax. This is akin to hyperparameter searching in machine learning, and automatic syntax optimization in machine learning is an emerging field [5].
Fig. 1. Comparison of outputs following a minor change in the prompt. The model used is Microsoft’s Phi-3-mini-4k-instruct, quantized to 8 bits [13]. The temperature is set to 0 for a deterministic output. A: The prompt “Calculate the LI-RADS category…” leads the model to hallucinate a non-existing LI-RADS category, C3. B: Changing the prompt to “Determine the LI-RADS category…” results in the model outputting a valid and correct LI-RADS category of 5. LI-RADS = Liver Imaging Reporting and Data System.
Even with perfect syntax, simply stating what one wants may not yield good results if the LLM lacks the relevant background knowledge in its parametric memory (knowledge stored in the model’s parameters) to perform a given task [6]. Therefore, researchers must provide detailed information within the prompt, which is where expert knowledge becomes valuable. Experts in their respective fields can curate detailed manuals for a given task and fill knowledge gaps by reviewing LLM errors. Generally, making the prompt more specific improves performance but carries a small possibility of diluting attention to other important content. The model’s capability to follow instructions and reason determines how specific the prompt can be. To summarize, three components must be considered to achieve successful outcomes through prompt engineering: syntax, content, and the capability of the LLM.
Model Adjustment Regarding Stochasticity
Most proprietary LLMs on chat interfaces are nondeterministic. The keyword here is the chat interface. Intentional randomness is set to elicit a natural language fit for chat. Thus, using the same prompt repeatedly generates different outputs, which is referred to as ‘stochasticity.’ Consequently, experimentation using a chat interface is expected to have relatively low reproducibility. To maximize the reproducibility, one should use the application programming interface (API) through code or an interactive webpage simulating the API (e.g., GPT-4 playground, Claude console) to control the level of stochasticity. Temperature is a hyperparameter that controls the randomness of the model’s output; lower values make the output more deterministic and less random. Top-k sampling limits the options for the next token in a model’s response to the top-k most probable tokens at each step [7]. Setting the temperature to zero or top-k sampling to one will make the model select the most likely token, making model operation essentially deterministic [8]. However, note that this randomness is not completely avoidable [9,10]. One method to handle stochastic models is to perform multiple runs using the API. Locally downloadable models can be set to a completely deterministic behavior by using greedy sampling [11].
Additional Empirical Tips
Here, we present additional tips on prompt optimization acquired from a recent study (Fig. 2A) [2].
Fig. 2. Excerpt of the prompts used in Gu et al. [2] and an example demonstrating the effect of prompt chaining. A: The summarization prompt is executed first, inserting each report in the ```(input)```. The model output from this prompt is then used as the ```(input)``` of the feature extraction prompt. For illustration, the detailed instructions on extracting the lesion location are preserved, whereas details of the other subtasks are abbreviated. B: Effect of prompt chaining. On the left side, the model is asked to summarize the report first, and then extract the features using only the summary. However, the model fails to ignore the original report and incorrectly extracts a feature from a non-index lesion. On the right side, the prompt is divided into two parts. The response from the summarization prompt is fed into the feature extraction prompt as a separate chat session. This approach extracts features from only the index lesion, as the original report is not accessible to the model.
Sequential Prompts Instead of a Single Prompt Containing All Instructions
Our initial prompt was simple: “Extract the following features and calculate the LI-RADS category from the report.” Although initially accurate for a few samples, it frequently produced errors as it was scaled up. The model did not focus on the index lesion and included irrelevant findings. We improved results by first prompting the model to create a summary report and then using the summarized information to produce the final answers which yielded better results. This approach of splitting instructions into separate prompts worked better when we physically separated them into two separate chat sessions, effectively filtering out noise, compared with using both prompts in a single chat session (Fig. 2B). With the latter method, the model could not “unsee” the previous prompt. The use of sequential prompts is known as ‘prompt chaining’ [3].
Fine-Grained Instructions
The prompt details evolved from providing a glossary to incorporating fine details for handling specific cases. These adjustments were tested incrementally by adding them after each round of evaluation in a separate dataset reserved for prompt optimization, distinct from the test dataset. This approach was incremental due to prompt brittleness as new errors frequently emerged after seemingly unrelated prompt changes.
Use of Numbered Lists and Bullet Points
Structured prompts, such as numbered lists or bullet points, offer two advantages: they facilitate modifications during prompt engineering and potentially enhance performance. This is based on the hypothesis that LLMs might prefer structured instructions. This supposition is supported by the structured nature of instruction datasets [12] and the common use of structured formats in manuals designed for humans, suggesting that any complex instructions in the pretraining data and supervised fine-tuning data might have been more common in structured formats.
Overall Setup for Prompt Optimization
The last section presents an example setup for prompt optimization in a clinical research study using an LLM.
Step 1: Is Your Task Suitable for an LLM?
Not all language tasks require an LLM. For example, extracting the Breast Imaging Reporting and Data System (BI-RADS) category from mammography reports may be trivial using regular expressions or an Excel function. However, some tasks are overly complex. A general rule of thumb is that if a task is difficult for a professional in the domain, it is also likely to be difficult for an LLM.
Step 2: Prepare Data for Prompt Optimization Separately From Test Data
When collecting research data, ensure that it is properly de-identified and consult the institutional data review board. Separate the data used for prompt optimization from the test data to ensure a fair evaluation of LLM performance. If data is limited, one may consider creating synthetic data for prompt optimization and reserve real data for testing. Ensure that the test set consists solely of real data. Test data should not be examined during prompt optimization or synthetic data creation.
Step 3: Generate the Labels and Set the Metric
Generating high-quality labels is labor-intensive; therefore, careful planning is essential. For example, in our previous study, we initially labeled ancillary features favoring malignancy collectively as either positive or negative. However, the LLM performed better when evaluating each ancillary feature separately. Consequently, we revised our approach to label each ancillary feature separately based on its positive or negative status. Perform a pilot study on a handful of cases to determine the optimal labeling method before applying it to the entire dataset. Additionally, establish a metric for optimizing the prompts. For example, one may decide to maximize classification accuracy.
Step 4: Set Up and Run the Evaluation Loop
Set up a feedback loop to calculate the accuracy for each prompt. Conduct experiments with different prompt formats and engineering techniques. It is critical to monitor the model’s errors, as this feedback is essential for refining the prompt content. Automate the evaluation process if possible to manage numerous prompt modifications efficiently, as multiple iterations may be required.
Step 5: Finalize the Prompt and Run on the Test Dataset
After determining the target metric on the dataset for prompt optimization, finalize the prompts and do not change them anymore. Then, use the same prompts and methods on the test data. For nondeterministic models (e.g., GPT-4, Claude), run the test multiple times to ensure consistency [1]. There is no consensus regarding the optimal number of runs; this is an issue that requires further exploration. It is crucial to log the experimental details thoroughly, as many details need to be reported [1].
Footnotes
Conflicts of Interest: The authors have no potential conflicts of interest to disclose.
- Conceptualization: Jeong Hyun Lee.
- Project administration: Jeong Hyun Lee.
- Supervision: Jaeseung Shin.
- Writing–original draft: Jeong Hyun Lee.
- Writing–review & editing: Jaeseung Shin.
Funding Statement: None
References
- 1.Park SH, Suh CH. Reporting guidelines for artificial intelligence studies in healthcare (for both conventional and large language models): what’s new in 2024. Korean J Radiol. 2024;25:687–690. doi: 10.3348/kjr.2024.0598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gu K, Lee JH, Shin J, Hwang JA, Min JH, Jeong WK, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578–1587. doi: 10.1111/liv.15891. [DOI] [PubMed] [Google Scholar]
- 3.Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The prompt report: a systematic survey of prompting techniques. [accessed on July 14, 2024];arXiv [Preprint] 2024 doi: 10.48550/arXiv.2406.06608. [DOI] [Google Scholar]
- 4.Kaddour J, Harris J, Mozes M, Bradley H, Raileanu R, McHardy R. Challenges and applications of large language models. [accessed on July 14, 2024];arXiv [Preprint] 2023 doi: 10.48550/arXiv.2307.10169. [DOI] [Google Scholar]
- 5.Khattab O, Singhvi A, Maheshwari P, Zhang Z, Santhanam K, Vardhamanan S, et al. DSPy: compiling declarative language model calls into self-improving pipelines. [accessed on July 14, 2024];arXiv [Preprint] 2023 doi: 10.48550/arXiv.2310.03714. [DOI] [Google Scholar]
- 6.Wang B, Yue X, Su Y, Sun H. Grokked transformers are implicit reasoners: a mechanistic journey to the edge of generalization. [accessed on July 14, 2024];arXiv [Preprint] 2024 doi: 10.48550/arXiv.2405.15071. [DOI] [Google Scholar]
- 7.Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text degeneration. [accessed on July 14, 2024];arXiv [Preprint] 2019 doi: 10.48550/arXiv.1904.09751. [DOI] [Google Scholar]
- 8.Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology. 2024;310:e232756. doi: 10.1148/radiol.232756. [DOI] [PubMed] [Google Scholar]
- 9.Anthropic. Create a text completion. [accessed on July 14, 2024]. Available at: https://docs.anthropic.com/en/api/complete.
- 10.OpenAI. Reproducible outputs. [accessed on July 14, 2024]. Available at: https://platform.openai.com/docs/guides/text-generation/reproducible-outputs.
- 11.Hugging Face. Text generation strategies. [accessed on July 14, 2024]. Available at: https://huggingface.co/docs/transformers/main/en/generation_strategies#greedy-search.
- 12.Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, et al. Alpaca: a strong, replicable instruction-following model. [accessed on July 14, 2024]. Available at: https://crfm.stanford.edu/2023/03/13/alpaca.html.
- 13.Abdin M, Jacobs SA, Awan AA, Aneja J, Awadallah A, Awadalla H, et al. Phi-3 technical report: a highly capable language model locally on your phone. [accessed on July 14, 2024];arXiv [Preprint] 2024 doi: 10.48550/arXiv.2404.14219. [DOI] [Google Scholar]