Abstract
Introduction
The rapid advancement of artificial intelligence (AI), particularly the development of Large Language Models (LLMs) such as Generative Pretrained Transformers (GPTs), has revolutionized numerous fields. The purpose of this study is to investigate the application of LLMs within the realm of orthopaedic in training examinations.
Methods
Questions from the 2020–2022 Orthopaedic In-Service Training Exams (OITEs) were given to OpenAI's GPT-3.5 Turbo and GPT-4 LLMs, using a zero-shot inference approach. Each model was given a multiple-choice question, without prior exposure to similar queries, and their generated responses were compared to the correct answer within each OITE. The models were evaluated on overall accuracy, performance on questions with and without media, and performance on first- and higher-order questions.
Results
The GPT-4 model outperformed the GPT-3.5 Turbo model across all years and question categories (2022: 67.63% vs. 50.24%; 2021: 58.69% vs. 47.42%; 2020: 59.53% vs. 46.51%). Both models showcased better performance with questions devoid of associated media, with GPT-4 attaining accuracies of 68.80%, 65.14%, and 68.22% for 2022, 2021, and 2020, respectively. GPT-4 outscored GPT-3.5 Turbo on first-order questions across all years (2022: 63.83% vs. 38.30%; 2021: 57.45% vs. 50.00%; 2020: 65.74% vs. 53.70%). GPT-4 also outscored GPT-3.5 Turbo on higher-order questions across all years (2022: 68.75% vs. 53.75%; 2021: 59.66% vs. 45.38%; 2020: 53.27% vs. 39.25%).
Discussion
GPT-4 showed improved performance compared to GPT-3.5 Turbo in all tested categories. The results reflect the potential and limitations of AI in orthopaedics. GPT-4's performance is comparable to a second-to-third-year resident and GPT-3.5 Turbo's performance is comparable to a first-year resident, suggesting the application of current LLMs can neither pass the OITE nor substitute orthopaedic training. This study sets a precedent for future endeavors integrating GPT models into orthopaedic education and underlines the necessity for specialized training of these models for specific medical domains.
Keywords: Artificial intelligence, Resident education, Orthopaedics, Orthopaedic in-service training exams
1. Introduction
The rapid advancement of artificial intelligence (AI), especially the development of Large Language Models (LLMs) such as Generative Pretrained Transformers (GPTs), has revolutionized numerous industries, including medical education.1,2 In particular, Chat Generative Pre-trained Transformer (ChatGPT) is a LLM developed by OpenAI to understand natural language which has exploded in popularity due to its accessibility and human-like responses.3 ChatGPT was initially released in Fall 2022 with model GPT-3.5 which has 175 billion parameters. The size of a LLM, often measured by the number of parameters it has, significantly influences its performance. The parameters are the learned aspects of the model, helping it understand the nuances of language, context, and information representation. With a greater number of parameters, a model has more “learning capacity”, allowing it to capture and reproduce the complexities of human language more effectively. This can lead to more accurate predictions and responses, making the model more useful across a broader range of tasks and industries. In comparison to earlier models, the parameter count of GPT-3.5 is significantly higher. For instance, GPT-2, the predecessor to GPT-3, had only 1.5 billion parameters. Each increase in parameters typically results in an improvement in performance, albeit with diminishing returns and increased computational and energy costs. OpenAI recently released GPT-4 which has significant performance improvements over GPT-3.5. While an exact parameter count was not officially released, it is believed to have more than one trillion parameters,4,5 with recent rumors suggesting a collective parameter size of 1.76 trillion.6 The increased abilities of GPT-4 allow it to interpret images and solve more complex problems with greater accuracy and reasoning.4 Both GPT-3.5 and GPT-4 have a training cutoff of September 2021, meaning they have not been exposed to information after that date.
The advancement and popularity of ChatGPT have led investigators to explore its potential in education. Both GPT-4 and GPT-3.5 have shown fluency in a variety of exams such as Advanced Placement (AP) exams, the SAT, and the Bar.7 Recently, these models have also been applied to medical education and board examinations which require not only advanced reasoning skills but also the application of domain-specific knowledge.8, 9, 10 Kung et al. demonstrated that GPT-3.5 nearly passed the United States Medical Licensing Examination (USMLE) Step 1, 2, and 3.1 Further studies using GPT-4 demonstrated a 20% increase in scores across the three USMLE examinations.11 Advancement of new versions can also be seen in fields outside of medicine. For example, GPT-4 successfully passed the Bar exam, whereas GPT-3.5 was unable to pass.12,13 With the knowledge that ChatGPT was able to pass graduate-level standardized exams, several medical specialties have tested ChatGPT with their specialty-specific exams with varying success.1,14,15 In a comparison of GPT-4 and GPT-3.5, researchers discovered that GPT-4 and GPT-3.5 were able to pass the neurosurgery written board examination, while only GPT-4 was able to pass the ophthalmology written board examination.16,17
To our knowledge, ChatGPT's performance on the written Orthopaedic Boards Examination (American Board of Orthopaedic Surgery (ABOS) Part I) or the Orthopaedic In-Training Examination (OITE) has not been documented in the literature. The OITE is a standardized test administered by the American Academy of Orthopaedic Surgeons (AAOS) to all orthopaedic surgery residents in the United States. Its purpose is to assess the scope of knowledge of orthopaedic surgeon residents throughout their training.18 The purpose of this study is to assess and compare the performance of the two latest GPT models across several years of the OITE.
2. Methods
Our dataset was comprised of questions from the OITE for the years 2020, 2021, and 2022 accessed through the AAOS ResStudy question bank.19 The test administrator classified each question into one of ten orthopaedic categories: Adult Reconstruction, Basic Science, Foot and Ankle, Hand and Wrist, Musculoskeletal Tumors and Diseases, Pediatric Orthopaedics, Shoulder and Elbow, Spine, Sports Medicine, and Trauma.
We employed two transformer-based language models developed by OpenAI, GPT-3.5-turbo-0301 (GPT-3.5 Turbo) and GPT-4-0314 (GPT-4), using the OpenAI Application Programming Interface (API). The GPT-3.5 Turbo is a version of GPT-3.5 optimized for cost-effectiveness.3 These models were given a sequence of OITE questions and tasked to select the correct answer from a list of choices. The API call consisted of a system message and a user message. The system message sets the context for the conversation and guides the response behavior.20 Our system message was “You are a knowledgeable orthopaedic AI trained to answer multiple-choice questions accurately. Here is a question for you. Respond with just the answer choice and nothing else.” Each question was then given to each respective model as a separate call (with no knowledge of prior questions).
The API then returned a response object containing the model's response. The choice selected by the AI model was then compared with the correct answer to determine if the prediction was accurate. The performance of the models was evaluated across the different categories of questions and for each year. Images or media associated with the questions were not included, as the current publicly released iterations of the models do not support image input and any textual description of the images would likely unfairly bias the AI in its answer selection.
The overall performance of each model was evaluated by computing the proportion of correctly answered questions out of the total number of questions. This computation was performed for each year and each category of questions. Furthermore, two authors (NC and MR) manually and independently categorized all questions into first-order or higher-order questions. Conflicting classifications were resolved by a third independent author (DC). First-order questions were defined as those involving factual recall. Higher-order questions were defined as those that required intermediary steps to arrive at the correct answer choice.
All analyses and API calls were conducted using Python 3.10.7 with libraries numpy 1.23.5 and pandas 2.0.0 for data management, and matplotlib 3.7.1 and seaborn 0.12.2 for data visualization.
3. Results
A total of 215, 213, and 207 questions from the 2020, 2021, and 2022 OITE, respectively, were administered to both GPT-3.5 Turbo and GPT-4. Each of the ten categories had a varying number of questions among different years (Table 1).
Table 1.
Question categories from the Orthopaedic In-Training Examination from years 2022-2020.
| Category | 2022 | 2021 | 2020 |
|---|---|---|---|
| Adult Reconstruction | 34 | 25 | 23 |
| Basic Science | 28 | 21 | 28 |
| Foot and Ankle | 15 | 22 | 19 |
| Hand and Wrist | 20 | 21 | 13 |
| Musculoskeletal Tumors and Diseases | 14 | 19 | 21 |
| Pediatric Orthopaedics | 32 | 31 | 23 |
| Shoulder and Elbow | 14 | 20 | 14 |
| Spine | 23 | 16 | 31 |
| Sports Medicine | 5 | 10 | 17 |
| Trauma | 22 | 28 | 26 |
| Total | 207 | 213 | 215 |
Overall, GPT-4 outperformed GPT-3.5 Turbo across all years and question categories. However, the models' performances varied across different question categories throughout the years (Fig. 1, Fig. 2, Fig. 3). In 2022, GPT-4 achieved an overall accuracy of 67.63%, contrasting with GPT-3.5 Turbo's 50.24%. Comparable patterns were observed in 2021 (GPT-4: 58.69%, GPT-3.5 Turbo: 47.42%) and 2020 (GPT-4: 59.53%, GPT-3.5 Turbo: 46.51%) (Fig. 4). Both models showcased better performance with questions devoid of associated media. GPT-4 continued to attain higher accuracies across all three years tested: 2022 (GPT-4: 68.80%, GPT-3.5 Turbo: 52.80%), 2021 (GPT-4: 65.14%, GPT-3.5 Turbo: 52.29%), and 2020 (GPT-4: 68.22%, GPT-3.5 Turbo: 51.94%). With regards to questions with associated media, GPT-4 also maintained higher accuracies across all three years tested: 2022 (GPT-4: 65.85%, GPT-3.5 Turbo: 46.34%), 2021 (GPT-4: 51.92%, GPT-3.5 Turbo: 42.31%), and 2020 (GPT-4: 46.51%, GPT-3.5 Turbo: 38.37%) (Fig. 5).
Fig. 1.
Accuracy by category between GPT-4 and GPT-3.5 Turbo for 2022 questions.
Fig. 2.
Accuracy by category between GPT-4 and GPT-3.5 Turbo for 2021 questions.
Fig. 3.
Accuracy by category between GPT-4 and GPT-3.5 Turbo for 2020 questions.
Fig. 4.
Overall accuracy between GPT-4 and GPT-3.5 Turbo in all years tested.
Fig. 5.
Accuracy without and with media between GPT-4 and GPT-3.5Turbo in all years tested.
In terms of accuracy with regard to question order, GPT-4 outscored GPT-3.5 Turbo on first-order questions in all three years: 2022 (GPT-4: 63.83%, GPT-3.5 Turbo: 38.30%), 2021(GPT-4: 57.45%, GPT-3.5 Turbo: 50.00%), and 2020 (GPT-4: 65.74% GPT-3.5 Turbo: 53.70%). GPT-4 also outscored GPT-3.5 Turbo on higher-order questions in all three years 2022 (GPT-4: 68.75%, GPT-3.5 Turbo: 53.75%), 2021 (GPT-4: 59.66%, GPT-3.5 Turbo: 45.38%), and 2020 (GPT-4: 53.27%, GPT-3.5 Turbo:39.25%). Both GPTs were inconsistent in scoring a higher accuracy in either first-order order or higher-order questions in all three years (Table 2).
Table 2.
Performance of GPT-4 and GPT-3.5 Turbo on first- and higher-order questions.
| Year | Model | Overall Accuracy | First-Order Accuracy | Higher-Order Accuracy |
|---|---|---|---|---|
| 2022 | n = 207 | n = 47 | n = 160 | |
| GPT-4 | 67.63% | 63.83% | 68.75% | |
| GPT 3.5-Turbo | 50.24% | 38.30% | 53.75% | |
| 2021 | n = 213 | n = 94 | n = 119 | |
| GPT-4 | 58.69% | 57.45% | 59.66% | |
| GPT 3.5-Turbo | 47.42% | 50.00% | 45.38% | |
| 2020 | n = 215 | n = 108 | n = 107 | |
| GPT-4 | 59.53% | 65.74% | 53.27% | |
| GPT 3.5-Turbo | 46.51% | 53.70% | 39.25% |
4. Discussion
Our study investigates and compares the accuracy of GPT-4 and GPT-3.5 Turbo on the OITE from 2020 to 2022. Notably, we found GPT-4 exceeded GPT-3.5 Turbo in overall accuracy and questions with and without associated media. The difference between the two models' performance could be attributed to GPT-4's larger parameter count compared to GPT-3.5 Turbo, given that both models were trained on an identical dataset. Moreover, the enhanced reasoning capabilities of GPT-4, stemming from its superior size and architecture, likely contribute to its superior performance. Interestingly, we noted that both GPT models exhibited poorer performance on questions accompanied by media compared to questions lacking any associated media. This outcome is to be expected, as orthopaedic knowledge relies heavily on the interpretation of imaging studies. While our study demonstrates that GPT-4 has the capability to outperform GPT-3.5 Turbo even without the aid of associated media, we hypothesize that GPT-4's potential to process image inputs might give it an additional edge over GPT-3.5 Turbo should associated media be incorporated. However, this hypothesis remains untested because the image-capable model of GPT-4 was not publicly accessible at the time of our study. This presents an intriguing future direction for research once such a model becomes publicly available.
In terms of accuracy, our results suggest that both GPT-4 and GPT-3.5 Turbo would not have passed the OITE in these years compared to human test-takers. According to the OITE technical reports for 2022, 2021, and 2020, passing scores for the American Board of Orthopaedic Surgery Part I licensing exam were 68.6%, 69.2%, and 63.0%, respectively.21, 22, 23 Furthermore, we found that GPT-4 scores on a level between residents in their second and third years.21, 22, 23 In comparison, GPT-3.5 Turbo scored on a level of residents in their first year. These findings suggest that the application of current LLMs, although impressive in their baseline medical knowledge, does not substitute for the rigorous and thorough training residents receive during orthopaedic residency.
Interestingly, other specialties have found different results when comparing the accuracy of GPT-4 compared to GPT-3.5 regarding their specialty-specific board examinations. GPT-4, but not GPT-3.5, passed the ophthalmology board exam without being provided any media.17 In addition, GPT-3.5 was unable to pass the Plastic Surgery In-Service Examination or the American Urological Association's Self-Assessment Study Program.14, 15 One would believe that GPT-3.5 is not well equipped for these exams because they are more specialized compared to an exam like the USMLE and GPT-3.5 is not as capable as GPT-4. However, it has been reported that both GPT-4 and GPT-3.5 were able to pass the neurosurgery written board examination, another highly subspecialized exam.16 A possible explanation may be that the neurosurgery board exam may have more questions with shorter prompts or lower-order questions because it has been demonstrated that GPT-3.5 struggles with longer questions and higher-order questions.16,17 We also demonstrate that GPT-4 is able to outscore GPT-3.5 Turbo on both first-order and higher-order questions, substantiating GPT-4's improved performance compared to GPT-3.5 Turbo on subspecialty exams.16,17 However, we found that GPT-3.5 Turbo is occasionally more accurate on higher-order questions than first-order questions and GPT-4 is occasionally more accurate on first-order questions, which conflicts with literature that tests the performance of both GPTs on different classes of questions.16,17 The inconsistencies between the two GPT models in terms of accuracy based on question order and their varying success among subspecialized exams warrant further investigation before they become further integrated into the medical field.
While these models were able to demonstrate the capacity to understand the fundamentals of orthopaedics, their integration into the field may be limited because the publicly released versions of these models do not support image processing. This significantly limits these models as orthopaedics is a field that heavily depends on visualization and imaging to diagnose and treat patients. In addition, the overall accuracies achieved by both models in our study are not acceptable for patient care. Thus, orthopaedic-related questions from patients or orthopaedic-related answers from ChatGPT should be directed to an orthopaedic surgeon before any medical decision-making. Another limitation of this study is the ability of AI networks to constantly grow, access more information, and consider more parameters. Therefore, if this study were to be repeated with the same methodology with updated models, the results may be different. The variability was accounted for in this study by collecting data over only one day.
5. Conclusion
In conclusion, GPT-4 exhibited superior performance to GPT-3.5 Turbo on the OITE, reflecting the potential and limitations of AI in orthopaedics. The LLMs demonstrated improved handling of questions without associated media, unsurprisingly, as a significant amount of orthopaedic information is contained within imaging studies, and this additional information was not given to the models. This study sets a precedent for future endeavors integrating GPT models into orthopaedic education and underlines the necessity for specialized training of these models for specific medical domains.
Author contributions
All authors participated in the study and helped shape the research question, data, analysis, and manuscript.
Conflicts of interest and Source of funding
The authors, their immediate family, and any research foundation with which they are affiliated did not receive any financial payments or other benefits from any commercial entity related to the subject of this article. There are no relevant disclosures. We have no conflicts of interest. The Manuscript submitted does not contain information about medical device(s)/drug(s). All authors significantly contributed to the document and have reviewed the final manuscript.
Declaration of competing interest
None.
Acknowledgements
None.
References
- 1.Kung T.H., Cheatham M., Medenilla A., et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2 doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gilson A., Safranek C.W., Huang T., et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large Language Models for medical education and knowledge assessment. JMIR Med Educ. 2023;9 doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt Available at:
- 4.OpenAI. GPT-4 technical report. https://openai.com/research/gpt-4 Available at:
- 5.Lubbad M. The ultimate guide to GPT-4 parameters: everything you need to know about NLP's game-changer. https://medium.com/@mlubbad/the-ultimate-guide-to-gpt-4-parameters-everything-you-need-to-know-about-nlps-game-changer-109b8767855a Available at:
- 6.Wang S. Commoditizing the petaflop — with George Hotz of the tiny corp. https://www.latent.space/p/geohot#details Available at:
- 7.OpenAI. GPT-4. https://openai.com/research/gpt-4 Available at:
- 8.Lee H. The rise of ChatGPT: exploring its potential in medical education. Anat Sci Educ. 2023 doi: 10.1002/ase.2270. [DOI] [PubMed] [Google Scholar]
- 9.Mogali S.R. Initial impressions of ChatGPT for anatomy education. Anat Sci Educ. 2023 doi: 10.1002/ase.2261. [DOI] [PubMed] [Google Scholar]
- 10.Johnson D., Goodman R., Patrinely J., et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the chat-GPT model. Res Sq. 2023 [Google Scholar]
- 11.Nori H., King N., McKinney S.M., Carignan D., Horvitz E. 2023. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375. [Google Scholar]
- 12.Bommarito M.J., Katz D.M. Social Science Research Network; 2023. GPT Takes the Bar Exam. [Google Scholar]
- 13.Katz D.M., Bommarito M.J., Gao S., Arredondo P. Social Science Research Network; 2023. GPT-4 Passes the Bar Exam. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Humar P., Asaad M., Bengur F.B., Nguyen V. ChatGPT is equivalent to first year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service exam. Aesthetic Surg J. 2023 doi: 10.1093/asj/sjad130. [DOI] [PubMed] [Google Scholar]
- 15.Skalidis I., Cagnina A., Luangphiphat W., et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4:279–281. doi: 10.1093/ehjdh/ztad029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ali R., Tang O.Y., Connolly I.D., et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. medRxiv. 2023 doi: 10.1227/neu.0000000000002632. 2023.2003.2025.23287743. [DOI] [PubMed] [Google Scholar]
- 17.Lin J.C., Younessi D.N., Kurapati S.S., Tang O.Y., Scott I.U. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye. 2023 doi: 10.1038/s41433-023-02564-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.AAOS orthopaedic in-training examination (OITE) https://www.aaos.org/education/about-aaos-products/orthopaedic-in-training-examination-oite/ Available at: [DOI] [PubMed]
- 19.ResStudy - Orthopaedic Exam Question Bank | American Academy of Orthopaedic Surgeons https://www.aaos.org/education/examinations/ResStudy/
- 20.OpenAI. Chat completions API. https://platform.openai.com/docs/guides/gpt/chat-completions-api
- 21.AAOSoite: orthopaedic in-training examination (OITE) technical report 2022. https://www.aaos.org/globalassets/education/product-pages/oite/oite-2022-technical-report-20230125.pdf Available at:
- 22.AAOSoite: orthopaedic in-training examination (OITE) technical report 2021. https://www.aaos.org/globalassets/education/product-pages/oite/oite-2021-technical-report.pdf Available at:
- 23.AAOSoite: orthopaedic in-training examination (OITE) technical report 2020. https://www.aaos.org/globalassets/education/product-pages/oite/oite-2020-technical-report_website.pdf Available at:





