Editor— We read with interest the study by Mija and colleagues1 on whether ChatGPT can be used to assist anaesthesiologists in making evidence-based clinical decisions with regard to postoperative pain treatment for a selection of specific surgical procedures. The authors commendably compared the output of two different GPT models (GPT-3.5 and -4) with evidence-based recommendations provided in the procedure-specific postoperative pain management (PROSPECT) guidelines. They found that, irrespective of the GPT version, GPT provided treatment recommendations that did not consistently follow recommendations by PROSPECT or deviated from PROSPECT in not advising recommended treatment options. Interestingly, GPT-4 did not always outperform the earlier model GPT-3.5.1 We recently found similar results with regard to pain management recommendations when asking GPT-4 to make preanaesthetic plans for a variety of patients.2 The two studies both conclude that GPT-4-guided recommendations should be interpreted cautiously.
It is important to explore what factors influence the quality of ChatGPT's output. As GPT models evolve rapidly, responses vary over time. In addition, prompt optimisation can significantly boost the output of GPT. These aspects are often overlooked in the existing literature; many publications focus on general performance metrics or static evaluations of large language models. The current study was designed to investigate specific factors that influence the dynamics of GPT's output. To examine this, we selected a patient from the paper by Mija and colleagues1 (video-assisted thoracoscopic surgery [VATS] lobectomy) and applied the same prompt to ChatGPT model GPT-4 (gpt-4-turbo-2024-04-09), and its newer models GPT-4o (gpt-4o-2024-11-20) and the preview version of GPT-o1 (gpt-o1-2024-09-12), on November 20, 2024. Next, we investigated whether refinement of the prompt would increase the quality of the output of GPT-4o. To this end, we followed the recommendations made by OpenAI for prompt engineering3 and applied them to the prompt used by Mija and colleagues1 for the VATS lobectomy patient. Specifically, we added a role context, provided more detail on the procedure, stated that we wanted to comply with evidence-based medicine when caring for a patient and more specifically asked only for recommendations that follow international guidelines. Session history and learning mode were disabled.
The table in the supplemental materials shows the results that were obtained by inserting the prompt constructed by Mija and colleagues1 in the various GPT models. The newer GPT-4o model showed lower agreement with PROSPECT guidelines compared with the results obtained by Mija and colleagues1 using the older models GPT-3.5 and GPT-4 (agreement with PROSPECT items GPT-4o vs GPT-3.5 and GPT-4: 61% vs 67% and 72%; kappa statistic 0.24 vs 0.30 and 0.46, respectively). However, the newest available model, GPT-o1, performed superior to all previous GPT models (GPT-o1 agreement with PROSPECT 78%; kappa statistic 0.53). Interestingly, the GPT-4 model at the moment of prompting by Mija and colleagues1 (model build version unknown) performed considerably better compared with the GPT-4 model (gpt-4-turbo-2024-04-09) at the moment of our prompting (agreement with PROSPECT items 72% vs 61%; kappa statistic 0.46 vs 0.33, respectively; Table 1).
Table 1.
Interventions for VATS lobectomy from PROSPECT and recommendations given by GPT-4o (gpt-4o-2024-08-06) obtained using the prompt by Mija and colleagues1 and with the optimised prompt. Deviations from the PROSPECT are given in red. COX-2, cyclo-oxygenase-2; NSAID, nonsteroidal anti-inflammatory drug; PCA, patient-controlled analgesia; PROSPECT, procedure-specific postoperative pain management; VATS, video-assisted thoracoscopic surgery.
Number | Items | PROSPECT | GPT-4o prompt by Mija1 | GPT-4º-optimised prompt |
---|---|---|---|---|
1 | Paracetamol | Yes | Yes | Yes |
2 | NSAID/COX-2 specific inhibitors | Yes | Yes | Yes |
3 | Opioids (PCA)/rescue | Yes | Yes | Yes |
4 | Gabapentinoids | No | Yes | No |
5 | Neuraxial (epidural) analgesia | No | Yes | No |
6 | Cryotherapy | No | No | No |
7 | Local Infiltration (single) | No | Yes | No |
8 | Lidocaine infusion | No | Yes | No |
9 | Continuous local infiltration or repeat Injection | No | No | No |
10 | Paravertebral block | Yes | Yes | Yes |
11 | Erector spinae block | Yes | Yes | Yes |
12 | Serratus anterior plane block | Yes | No | Yes |
13 | Intercostal nerve block | No | Yes | No |
14 | Intrapleural analgesia | No | No | No |
15 | Dexamethasone | No | No | No |
16 | Transcutaneous electric nerve stimulation (TENS) | No | No | No |
17 | Magnesium sulfate | No | No | No |
18 | Dexmedetomidine | Yes | No | Yes |
Agreement, GPT vs PROSPECT (% of items) | – | 61 | 100 | |
Kappa statistic, GPT vs PROSPECT | – | 0.24 | 1.0 |
The second finding was that applying the optimised prompt to GPT-4o significantly improved the quality of the output. The optimised prompt is provided in the supplemental materials. We found that with the updated prompt, GPT-4o referred to the PROSPECT guidelines and followed these with perfect agreement (agreement with PROSPECT 100%, kappa statistic 1.0) compared with the fair agreement seen when using the original prompt by Mija and colleagues (agreement 61%; kappa statistic 0.24; see Table 1).
These findings have important implications for future studies using OpenAI's GPT models and the appreciation of prior results obtained with GPT. The first implication is that newer GPT models do not necessarily outperform older models. In addition, output can differ even when the same prompt is used in the same GPT model, but at a different time. Several factors might account for this variable output. The default temperature setting of 0.7 and the Top-P parameter in GPT introduce variability in model responses, resulting in differing outputs for the same prompt. Both settings can be adjusted to reduce variability, but this is only possible using application programming interfaces (APIs), such as the one provided by OpenAI, and requires specialised knowledge and programming skills. In addition, OpenAI frequently deploys updates to its models. Although major updates are marked with a deployment date (e.g. gpt-4o-2024-08-06), it is conceivable that minor updates are continuously implemented without notification, making output generation potentially time-sensitive. This lack of model versioning transparency is exacerbated by the fact that OpenAI has stopped publishing version numbers of major model versions. Lastly, when using ChatGPT to access GPT models, previous session contexts might be incorporated as memory into its training, potentially influencing subsequent outputs if the session history and learning mode are not explicitly disabled.
The second implication is that GPT users must be aware that using a high-quality prompt is of vital importance. We noted that prompt optimisation significantly improved the quality of the output by GPT. Users can refer to prompt engineering guides provided by OpenAI3 or others,4, 5, 6 and if possible, apply these to their query. Researchers can also use so-called ‘prompt engineering playgrounds’ to test their prompts before applying them to their study.
In conclusion, this study highlights that the results obtained from GPT models are highly sensitive to the model version, time of prompting, and quality of the prompt used. Optimising prompts and reducing the temperature parameter might significantly enhance the quality of GPT's output to (clinical) queries. Future research should prioritise prompt engineering and use APIs to control LLM model selection and settings. Additionally, caution is advised when using ChatGPT for research purposes, as variability introduced by unversioned updates can undermine research reproducibility and its reliability as a clinical decision support tool.
Authors’ contributions
Study design: MB, NF, MAM
Literature search: MAM, NF, MB
Data extraction and analysis: MAM, NF, MB
Statistical analysis: MAM, MB
Writing manuscript: MB, MAM
Revision of manuscript: MB, MAM, NF
All authors approved the final version for submission and agreed to be accountable for all aspects of the work thereby ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved
Declaration of interest
MAM is the founder of the medical data platform Delphyr, which is unrelated to this work. All other authors declare that they have no conflict of interest.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.bja.2024.12.034.
Appendix A. Supplementary data
The following is the Supplementary data to this article:
References
- 1.Mija D., Kehlet H., Rosero E.B., Joshi G.P. Evaluating the role of ChatGPT in perioperative pain management versus procedure-specific postoperative pain management (PROSPECT) recommendations. Br J Anaesth. 2024;133:1318–1320. doi: 10.1016/j.bja.2024.09.010. [DOI] [PubMed] [Google Scholar]
- 2.Abdel Malek M., van Velzen M., Dahan A., et al. Generation of preoperative anaesthetic plans by ChatGPT-4.0: a mixed-method study. Br J Anaesth. 2024 Nov 14 doi: 10.1016/j.bja.2024.08.038. S0007-0912(24)00598-1. Epub ahead of print. [DOI] [PubMed] [Google Scholar]
- 3.Prompt engineering. 2024. https://platform.openai.com/docs/guides/prompt-engineering Available from: [Google Scholar]
- 4.Alam S., Rahman A., Sohail S.S. Optimizing ChatGPT-4's radiology performance with scale-invariant feature transform and advanced prompt engineering. Clin Imaging. 2024;118 doi: 10.1016/j.clinimag.2024.110368. [DOI] [PubMed] [Google Scholar]
- 5.Lee J., Park S., Shin J., Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak. 2024;24:366. doi: 10.1186/s12911-024-02709-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Savage T, Nayak A, Gallo R et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. ArXiv:2308.06834. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.