Evaluating the role of ChatGPT in perioperative pain management: importance of version and prompt sensitivity. Comment on Br J Anaesth 2024; 133: 1318–20

Michel Abdel Malek; Nadia du Fosse; Martijn Boon

doi:10.1016/j.bja.2024.12.034

letter

. 2025 Feb 6;134(4):1241–1243. doi: 10.1016/j.bja.2024.12.034

Evaluating the role of ChatGPT in perioperative pain management: importance of version and prompt sensitivity. Comment on Br J Anaesth 2024; 133: 1318–20

Michel Abdel Malek ¹, Nadia du Fosse ¹, Martijn Boon ^1,^⁎

PMCID: PMC11947552 PMID: 39909797

Editor— We read with interest the study by Mija and colleagues¹ on whether ChatGPT can be used to assist anaesthesiologists in making evidence-based clinical decisions with regard to postoperative pain treatment for a selection of specific surgical procedures. The authors commendably compared the output of two different GPT models (GPT-3.5 and -4) with evidence-based recommendations provided in the procedure-specific postoperative pain management (PROSPECT) guidelines. They found that, irrespective of the GPT version, GPT provided treatment recommendations that did not consistently follow recommendations by PROSPECT or deviated from PROSPECT in not advising recommended treatment options. Interestingly, GPT-4 did not always outperform the earlier model GPT-3.5.¹ We recently found similar results with regard to pain management recommendations when asking GPT-4 to make preanaesthetic plans for a variety of patients.² The two studies both conclude that GPT-4-guided recommendations should be interpreted cautiously.

It is important to explore what factors influence the quality of ChatGPT's output. As GPT models evolve rapidly, responses vary over time. In addition, prompt optimisation can significantly boost the output of GPT. These aspects are often overlooked in the existing literature; many publications focus on general performance metrics or static evaluations of large language models. The current study was designed to investigate specific factors that influence the dynamics of GPT's output. To examine this, we selected a patient from the paper by Mija and colleagues¹ (video-assisted thoracoscopic surgery [VATS] lobectomy) and applied the same prompt to ChatGPT model GPT-4 (gpt-4-turbo-2024-04-09), and its newer models GPT-4o (gpt-4o-2024-11-20) and the preview version of GPT-o1 (gpt-o1-2024-09-12), on November 20, 2024. Next, we investigated whether refinement of the prompt would increase the quality of the output of GPT-4o. To this end, we followed the recommendations made by OpenAI for prompt engineering³ and applied them to the prompt used by Mija and colleagues¹ for the VATS lobectomy patient. Specifically, we added a role context, provided more detail on the procedure, stated that we wanted to comply with evidence-based medicine when caring for a patient and more specifically asked only for recommendations that follow international guidelines. Session history and learning mode were disabled.

The table in the supplemental materials shows the results that were obtained by inserting the prompt constructed by Mija and colleagues¹ in the various GPT models. The newer GPT-4o model showed lower agreement with PROSPECT guidelines compared with the results obtained by Mija and colleagues¹ using the older models GPT-3.5 and GPT-4 (agreement with PROSPECT items GPT-4o vs GPT-3.5 and GPT-4: 61% vs 67% and 72%; kappa statistic 0.24 vs 0.30 and 0.46, respectively). However, the newest available model, GPT-o1, performed superior to all previous GPT models (GPT-o1 agreement with PROSPECT 78%; kappa statistic 0.53). Interestingly, the GPT-4 model at the moment of prompting by Mija and colleagues¹ (model build version unknown) performed considerably better compared with the GPT-4 model (gpt-4-turbo-2024-04-09) at the moment of our prompting (agreement with PROSPECT items 72% vs 61%; kappa statistic 0.46 vs 0.33, respectively; Table 1).

Table 1.

Interventions for VATS lobectomy from PROSPECT and recommendations given by GPT-4o (gpt-4o-2024-08-06) obtained using the prompt by Mija and colleagues¹ and with the optimised prompt. Deviations from the PROSPECT are given in red. COX-2, cyclo-oxygenase-2; NSAID, nonsteroidal anti-inflammatory drug; PCA, patient-controlled analgesia; PROSPECT, procedure-specific postoperative pain management; VATS, video-assisted thoracoscopic surgery.

Number	Items	PROSPECT	GPT-4o prompt by Mija¹	GPT-4º-optimised prompt
1	Paracetamol	Yes	Yes	Yes
2	NSAID/COX-2 specific inhibitors	Yes	Yes	Yes
3	Opioids (PCA)/rescue	Yes	Yes	Yes
4	Gabapentinoids	No	Yes	No
5	Neuraxial (epidural) analgesia	No	Yes	No
6	Cryotherapy	No	No	No
7	Local Infiltration (single)	No	Yes	No
8	Lidocaine infusion	No	Yes	No
9	Continuous local infiltration or repeat Injection	No	No	No
10	Paravertebral block	Yes	Yes	Yes
11	Erector spinae block	Yes	Yes	Yes
12	Serratus anterior plane block	Yes	No	Yes
13	Intercostal nerve block	No	Yes	No
14	Intrapleural analgesia	No	No	No
15	Dexamethasone	No	No	No
16	Transcutaneous electric nerve stimulation (TENS)	No	No	No
17	Magnesium sulfate	No	No	No
18	Dexmedetomidine	Yes	No	Yes
	Agreement, GPT vs PROSPECT (% of items)	–	61	100
	Kappa statistic, GPT vs PROSPECT	–	0.24	1.0

Open in a new tab

The second finding was that applying the optimised prompt to GPT-4o significantly improved the quality of the output. The optimised prompt is provided in the supplemental materials. We found that with the updated prompt, GPT-4o referred to the PROSPECT guidelines and followed these with perfect agreement (agreement with PROSPECT 100%, kappa statistic 1.0) compared with the fair agreement seen when using the original prompt by Mija and colleagues (agreement 61%; kappa statistic 0.24; see Table 1).

These findings have important implications for future studies using OpenAI's GPT models and the appreciation of prior results obtained with GPT. The first implication is that newer GPT models do not necessarily outperform older models. In addition, output can differ even when the same prompt is used in the same GPT model, but at a different time. Several factors might account for this variable output. The default temperature setting of 0.7 and the Top-P parameter in GPT introduce variability in model responses, resulting in differing outputs for the same prompt. Both settings can be adjusted to reduce variability, but this is only possible using application programming interfaces (APIs), such as the one provided by OpenAI, and requires specialised knowledge and programming skills. In addition, OpenAI frequently deploys updates to its models. Although major updates are marked with a deployment date (e.g. gpt-4o-2024-08-06), it is conceivable that minor updates are continuously implemented without notification, making output generation potentially time-sensitive. This lack of model versioning transparency is exacerbated by the fact that OpenAI has stopped publishing version numbers of major model versions. Lastly, when using ChatGPT to access GPT models, previous session contexts might be incorporated as memory into its training, potentially influencing subsequent outputs if the session history and learning mode are not explicitly disabled.

The second implication is that GPT users must be aware that using a high-quality prompt is of vital importance. We noted that prompt optimisation significantly improved the quality of the output by GPT. Users can refer to prompt engineering guides provided by OpenAI³ or others,4, 5, 6 and if possible, apply these to their query. Researchers can also use so-called ‘prompt engineering playgrounds’ to test their prompts before applying them to their study.

In conclusion, this study highlights that the results obtained from GPT models are highly sensitive to the model version, time of prompting, and quality of the prompt used. Optimising prompts and reducing the temperature parameter might significantly enhance the quality of GPT's output to (clinical) queries. Future research should prioritise prompt engineering and use APIs to control LLM model selection and settings. Additionally, caution is advised when using ChatGPT for research purposes, as variability introduced by unversioned updates can undermine research reproducibility and its reliability as a clinical decision support tool.

Authors’ contributions

Study design: MB, NF, MAM

Literature search: MAM, NF, MB

Data extraction and analysis: MAM, NF, MB

Statistical analysis: MAM, MB

Writing manuscript: MB, MAM

Revision of manuscript: MB, MAM, NF

All authors approved the final version for submission and agreed to be accountable for all aspects of the work thereby ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved

Declaration of interest

MAM is the founder of the medical data platform Delphyr, which is unrelated to this work. All other authors declare that they have no conflict of interest.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.bja.2024.12.034.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1

mmc1.docx^{(20.9KB, docx)}

References

1.Mija D., Kehlet H., Rosero E.B., Joshi G.P. Evaluating the role of ChatGPT in perioperative pain management versus procedure-specific postoperative pain management (PROSPECT) recommendations. Br J Anaesth. 2024;133:1318–1320. doi: 10.1016/j.bja.2024.09.010. [DOI] [PubMed] [Google Scholar]
2.Abdel Malek M., van Velzen M., Dahan A., et al. Generation of preoperative anaesthetic plans by ChatGPT-4.0: a mixed-method study. Br J Anaesth. 2024 Nov 14 doi: 10.1016/j.bja.2024.08.038. S0007-0912(24)00598-1. Epub ahead of print. [DOI] [PubMed] [Google Scholar]
3.Prompt engineering. 2024. https://platform.openai.com/docs/guides/prompt-engineering Available from: [Google Scholar]
4.Alam S., Rahman A., Sohail S.S. Optimizing ChatGPT-4's radiology performance with scale-invariant feature transform and advanced prompt engineering. Clin Imaging. 2024;118 doi: 10.1016/j.clinimag.2024.110368. [DOI] [PubMed] [Google Scholar]
5.Lee J., Park S., Shin J., Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak. 2024;24:366. doi: 10.1186/s12911-024-02709-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Savage T, Nayak A, Gallo R et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. ArXiv:2308.06834. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.docx^{(20.9KB, docx)}

[bib1] 1.Mija D., Kehlet H., Rosero E.B., Joshi G.P. Evaluating the role of ChatGPT in perioperative pain management versus procedure-specific postoperative pain management (PROSPECT) recommendations. Br J Anaesth. 2024;133:1318–1320. doi: 10.1016/j.bja.2024.09.010. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Abdel Malek M., van Velzen M., Dahan A., et al. Generation of preoperative anaesthetic plans by ChatGPT-4.0: a mixed-method study. Br J Anaesth. 2024 Nov 14 doi: 10.1016/j.bja.2024.08.038. S0007-0912(24)00598-1. Epub ahead of print. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Prompt engineering. 2024. https://platform.openai.com/docs/guides/prompt-engineering Available from: [Google Scholar]

[bib4] 4.Alam S., Rahman A., Sohail S.S. Optimizing ChatGPT-4's radiology performance with scale-invariant feature transform and advanced prompt engineering. Clin Imaging. 2024;118 doi: 10.1016/j.clinimag.2024.110368. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Lee J., Park S., Shin J., Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak. 2024;24:366. doi: 10.1186/s12911-024-02709-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Savage T, Nayak A, Gallo R et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. ArXiv:2308.06834. [DOI] [PMC free article] [PubMed]

PERMALINK

Evaluating the role of ChatGPT in perioperative pain management: importance of version and prompt sensitivity. Comment on Br J Anaesth 2024; 133: 1318–20

Michel Abdel Malek

Nadia du Fosse

Martijn Boon

Table 1.

Authors’ contributions

Declaration of interest

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Evaluating the role of ChatGPT in perioperative pain management: importance of version and prompt sensitivity. Comment on Br J Anaesth 2024; 133: 1318–20

Michel Abdel Malek

Nadia du Fosse

Martijn Boon

Table 1.

Authors’ contributions

Declaration of interest

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases