Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 6.
Published in final edited form as: JOSPT Methods. 2025 Apr 28;1(2):56–60. doi: 10.2519/josptmethods.2025.0151

Improving ChatGPT’s Performance in Orthopedics: Opportunities Using the CRISPE Framework

Mark Vorensky 1,2, Daniel Peredo 2, Richard Ferraro 1, Emily Paris 2, Asma Mohammadi 2, Paul Spano 2, Smita Rao 3
PMCID: PMC12412748  NIHMSID: NIHMS2087913  PMID: 40919522

SYNOPSIS:

ChatGPT has been increasingly used in clinical practice, education, and research. In orthopedic research, ChatGPT’s accuracy in clinical decision-making has been a major concern, with results ranging from 33% to 80% accuracy. Inaccuracies from ChatGPT can be harmful to clinicians, trainees, or patients when responses appear plausible, are trusted, and acted upon. A critical limitation in orthopedic research is the lack of structured prompt engineering, which significantly impacts ChatGPT’s performance. The CRISPE (Capacity/Role, Insight, Statement, Personality, Experiment) framework offers a systematic approach to refining prompts and improving response accuracy. This Viewpoint applies the CRISPE framework to recent orthopedic research and highlights opportunities to optimize prompts in ChatGPT. While research is needed to validate and refine prompt engineering tools in orthopedics, these methods have the potential to enhance the accuracy and reliability of ChatGPT’s responses and serve as valuable tools in orthopedic practice, education, and research.

Keywords: artificial intelligence, ChatGPT, large language model, musculoskeletal, natural language processing, orthopedics

Graphical Abstract

graphic file with name nihms-2087913-f0001.jpg


Since its release in 2022, ChatGPT (Chat Generative Pre-trained Transformer), a chatbot from OpenAI, has rapidly become a prominent tool in education, clinical practice, and research. ChatGPT is a product of artificial intelligence (AI), the simulation of human intelligence, performing tasks such as natural language processing, data analysis, and predictive modeling. ChatGPT is built on a large language model, trained on vast data sets, and uses deep learning algorithms to generate human-like text based on input prompts. This model continuously evolves and is fine-tuned as it is exposed to more data, improving its ability to provide relevant and more human-like responses.

Rightful debate surrounds the use of ChatGPT for research, education, and clinical practice. Its quick responses and interactive nature can enhance efficiency; however, concerns include confidentiality, data security, response bias, and diminishing creativity and originality. A major concern surrounds the prevalence of inaccurate and misleading responses, especially when these appear plausible, are trusted, and acted upon.

Orthopedic research has tested ChatGPT accuracy when making clinical decisions.1,2,4,6,8 The accuracy of these decisions is often assessed by comparing ChatGPT’s responses to clinical practice guidelines.2,4,6,8 This has produced a range of results. The accuracy of clinical decisions for low back pain from ChatGPT-3.5 compared to North American Spine Society clinical practice guidelines was 72%.8 For lumbosacral radicular pain, ChatGPT-3.5 showed 33% accuracy compared to multiple clinical practice guidelines.2 ChatGPT-4 demonstrated 79% concordance to American Academy of Orthopaedic Surgeons clinical practice guidelines for rotator cuff tears and anterior cruciate ligament injuries.6 Similarly, ChatGPT-4 demonstrated 80% accuracy to physical therapy clinical practice guidelines across orthopedic and sports conditions (100% for upper extremity, 60% for spine, 87% for lower extremity conditions).4 When assessed by licensed physical therapists, ChatGPT-3.5 did not include elements of physical therapy reassessment and subjective examination 30% and 40% of the time, respectively.1 With a wide range of accuracies reported, conclusions are appropriately cautious, such as, “Patients and consumers, who may not have the training to critically evaluate clinical guidelines, should avoid relying on chatbots for musculoskeletal health advice.”2(p227) However, key limitations in current research lead to the question: Is the problem with ChatGPT or the way we use it?

The Primary Limitation of Orthopedic Research Assessing ChatGPT

Single-line prompts are often used when assessing ChatGPT’s accuracy. For example, “Should electrotherapies (such as TENS/PENS/interferential therapy) be used in the management of nonspecific low back pain and sciatica?”2(p224) and “Which are the main steps for a completed physiotherapy assessment?”1(p2945) This highlights a consistent limitation of orthopedic research that assesses ChatGPT’s accuracy, insufficient prompt engineering.

Prompt engineering “focuses on developing and optimizing prompts to effectively utilize large language models.”3(p2629) Although prompt engineering is a relatively new area of study, several techniques and frameworks have been proposed.7,10 Current research evaluating ChatGPT’s accuracy in providing orthopedic recommendations often do not systematically apply these prompting strategies, potentially leading to variable results.1,2,4,6,8 To address these inconsistencies, use of ChatGPT can be optimized through structured prompt frameworks. A promising framework, previously examined in education research, is abbreviated as CRISPE (Capacity/Role, Insight, Statement, Personality, Experiment).5,9 The CRISPE framework offers a systematic approach to crafting prompts by defining the following:

  • CR: Capacity/Role – specify the expertise and role(s).

  • I: Insight – provide background information and context.

  • S: Statement – articulate what you would like the chatbot to do.

  • P: Personality – indicate the style of the response.

  • E: Experiment – retrieve multiple examples.

Examining prompt frameworks, such as CRISPE, has the potential to help researchers understand whether ChatGPT lacks accuracy, if the user is inadequately engineering their prompts, or if there is a combination of both issues. For example, the study by Shrestha et al showed an accuracy for low back pain recommendations that was substantially higher than that of the study by Gianola et al on lumbosacral radicular pain published the same year (72% versus 33%, respectively).2,8 This is likely due to additional prompting provided by Shrestha et al, “Imagine you are an experienced orthopedics spine surgeon with a knowledgeable background in the latest research in the field of low back pain. Answer the following prompts based on evidence-based research studies and point out any lack of evidence to support a point (i.e. point out if there is a lack of study to support or refute your answer).”8(p645) Adding this information to the prompt significantly reduced the frequency of insufficient and conflicting responses compared to the question alone.8

Excluding the study by Shrestha et al, there has been minimal to no focus on prompt engineering when assessing ChatGPT’s accuracy for orthopedic recommendations. Even in the study by Shrestha et al, it is unclear as to what specific parts of their additional prompt reduced insufficient or conflicting responses. Was it the addition of asking for “evidence-based research” or was it the role in the prompt, “an experienced orthopedics spine surgeon”?8(p645) Systematic approaches that rigorously evaluate details of a prompt are needed to fully understand how to retrieve accurate and reliable information. This limitation opens a huge window of opportunity for orthopedic research, spanning from how prompts are effectively constructed to the ways patients and providers use prompting strategies. Next, we will share an example of how orthopedic researchers may consider prompt engineering in research on the accuracy of ChatGPT.

Systematically Assessing the Accuracy of ChatGPT

For this example, we will revisit the question from Gianola et al, “Should electrotherapies (such as TENS/PENS/ interferential therapy) be used in the management of nonspecific low back pain and sciatica?”2(p224) Gianola et al found that ChatGPT was inaccurate in its recommendations for this question; however, prompt engineering can be used to understand how to advance ChatGPT’s accuracy. First, the prompt above could be varied across the Capacity/Role, Insight, Statement, Personality, and/or Experiment:

  • CR: You are an expert physical therapist

  • I: Using clinical practice guidelines.

  • S: Should electrotherapies (such as transcutaneous electrical nerve stimulation/percutaneous electrical nerve stimulation/interferential therapy) be used in the management of nonspecific low back pain and sciatica?

  • P: Respond as if you are training a physical therapist.

  • E: Run the search multiple times.

For a structured and systematic approach, it is recommended that researchers modify 1 factor at a time to examine what element(s) alter ChatGPT’s accuracy. For example, the TABLE and the attached infographic show 2×2 factorial designs that have the potential to determine if explicitly defining the role of ChatGPT and/or adding insight to clinical practice guidelines improves accuracy. There are a range of research opportunities using this methodology, including evaluating how different factors within a prompt affect ChatGPT’s responses and how factors within a prompt interact with each other.

TABLE.

A 2×2 Factorial Design Assessing the Impact of Capacity/Role and Insight on ChatGPT’s Accuracy

No Capacity/Role Added Capacity/Role
No Insight
  • No capacity/role is added to the prompt.

  • No insight is added to the prompt.


Prompt: “Should electrotherapies (such as TENS/PENS/interferential therapy) be used in the management of nonspecific low back pain and sciatica?”2(p224)
  • “You are an expert physical therapist” is added to the prompt.

  • No insight is added to the prompt.


Prompt: “You are an expert physical therapist. Should electrotherapies (such as TENS/PENS/interferential therapy) be used in the management of nonspecific low back pain and sciatica?”2
Added Insight
  • No capacity/role is added to the prompt.

  • “Use clinical practice guidelines” is added to the prompt.


Prompt: “Should electrotherapies (such as TENS/PENS/interferential therapy) be used in the management of nonspecific low back pain and sciatica? Use clinical practice guidelines.2
  • “You are an expert physical therapist” is added to the prompt.

  • “Use clinical practice guidelines” is added to the prompt.


Prompt: “You are an expert physical therapist. Should electrotherapies (such as TENS/PENS/interferential therapy) be used in the management of nonspecific low back pain and sciatica? Use clinical practice guidelines.2

Abbreviations: PENS, percutaneous electrical nerve stimulation; TENS, transcutaneous electrical nerve stimulation.

An important consideration is testing prompts multiple times, the “E” of CRISPE. Because ChatGPT considers prior prompts, we recommend that between each search, all prior chats are deleted (in General Settings) and ChatGPT’s memory is cleared (in Personalization Settings). Conducting repeated trials and analyzing variability in responses can help assess inherent model limitations and/or inconsistencies.

Future Directions

Given the widespread use of ChatGPT among patients, clinicians, trainees, and researchers, a deeper understanding of prompt development and optimization is critical to promote accurate and reliable responses. Research should systematically assess the influence of individual prompt components, such as the role designation, detail of background information, how the core statement is phrased, and the personality of responses. Comparative analyses of different prompt engineering frameworks could also be performed (eg, CRISPE vs CREATE [Context, Result, Explanation, Audience, Tone, and Edit]).7 Additionally, there may be complementary prompt techniques, such as retrieval-augmented generation (ie, integrating external databases or evidence within prompts), that may enhance existing frameworks.10 Whether studies examine elements of a prompt or compare frameworks, researchers should document and standardize testing conditions, including the version of ChatGPT used, timing of queries, and use multiple investigators to promote reproducibility and minimize bias. Once optimal prompt engineering strategies are identified, efforts should be made to develop guidelines on the safe and effective use of ChatGPT as an orthopedic clinical support tool.

Lastly, additional research is needed to compare the accuracy and reliability of other large language model chatbots. Although recent research has shown higher performance from ChatGPT-4 compared to Gemini by Google, Mistral-7B by Mistral AI, and Claude-3 by Anthropic, these findings focused on medical and surgical care and were limited to recommendations surrounding rotator cuff and anterior cruciate ligament injuries.6 Additional research is needed to compare the performance of large language model chatbots across health conditions, asking questions specific to orthopedic and sports physical therapy.

Conclusion

This Viewpoint asked, “Is the problem with ChatGPT or the way we use it?” While several studies challenge the accuracy of ChatGPT,1,2,4,6,8 we argue that the answer to this question is still unknown. By systematically evaluating and refining prompting strategies, future research can contribute to more accurate and reliable AI-assisted decision-making in orthopedics, leveraging a technology that has the potential to enhance patient care and clinical outcomes.

Key Points.

  • Orthopedic research on ChatGPT often lacks structured prompt engineering, resulting in variable reports of accuracy.

  • Frameworks, like CRISPE, can help structure prompts for clinicians, researchers, trainees, and patients.

  • Systematic research on prompt frameworks may help to advance ChatGPT as an orthopedic clinical support tool.

Acknowledgments

Dr Vorensky’s effort was supported by the National Institutes of Neurological Disorders and Stroke of the National Institutes of Health through the University of Michigan HEAL National K12 Clinical Pain Career Development Award (K12NS130673). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

The authors certify that they have no affiliations with or financial involvement in any organization or entity with a direct financial interest in the subject matter or materials discussed in the article.

PATIENT AND PUBLIC INVOLVEMENT: Patients, athletes, or public partners were not involved in the current research.

DATA SHARING:

There are no data available for this Viewpoint.

REFERENCES

  • 1.Bilika P, Stefanouli V, Strimpakos N, Kapreli EV. Clinical reasoning using ChatGPT: is it beyond credibility for physiotherapists use? Physiother Theory Pract. 2024;40:2943–2962. 10.1080/09593985.2023.2291656 [DOI] [PubMed] [Google Scholar]
  • 2.Gianola S, Bargeri S, Castellini G, et al. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for lumbosacral radicular pain: a cross-sectional study. J Orthop Sports Phys Ther. 2024;54:1–7. 10.2519/jospt.2024.12151 [DOI] [PubMed] [Google Scholar]
  • 3.Giray L. Prompt engineering with ChatGPT: a guide for academic writers. Ann Biomed Eng. 2023;51:2629–2633. 10.1007/s10439-023-03272-4 [DOI] [PubMed] [Google Scholar]
  • 4.Hao J, Yao Z, Tang Y, Remis A, Wu K, Yu X. Artificial intelligence in physical therapy: evaluating ChatGPT's role in clinical decision support for musculoskeletal care. Ann Biomed Eng. 2025;53:9–13. 10.1007/s10439-025-03676-4 [DOI] [PubMed] [Google Scholar]
  • 5.Nigh M. ChatGPT3 prompt engineering. GitHub. Available at: https://github.com/mattnigh/ChatGPT3-Free-Prompt-List. Accessed October 15, 2024. [Google Scholar]
  • 6.Nwachukwu BU, Varady NH, Allen AA, et al. Currently available large language models do not provide musculoskeletal treatment recommendations that are concordant with evidencebased clinical practice guidelines. Arthroscopy. 2025;41:263–275.e6. 10.1016/j.arthro.2024.07.040 [DOI] [PubMed] [Google Scholar]
  • 7.Ross A, McGrow K, Zhi D, Rasmy L. Foundation models, generative AI, and large language models: essentials for nursing. Comput Inform Nurs. 2024;42:377–387. 10.1097/CIN.0000000000001149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shrestha N, Shen Z, Zaidat B, et al. Performance of ChatGPT on NASS clinical guidelines for the diagnosis and treatment of low back pain. Spine. 2024;49:640–651. 10.1097/BRS.0000000000004915 [DOI] [PubMed] [Google Scholar]
  • 9.Wang M, Wang M, Xu X, Yang L, Cai D, Yin M. Unleashing ChatGPT’s power: a case study on optimizing information retrieval in flipped classrooms via prompt engineering. IEEE Trans Learn Technol. 2024;17:629–641. 10.1109/tlt.2023.3324714 [DOI] [Google Scholar]
  • 10.Zaghir J, Naguib M, Bjelogrlic M, Neveol A, Tannier X, Lovis C. Prompt engineering paradigms for medical applications: scoping review. J Med Internet Res. 2024;26:e60501. 10.2196/60501 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

There are no data available for this Viewpoint.

RESOURCES