Can generative artificial intelligence pass the orthopaedic board examination?

Ula N Isleem; Bashar Zaidat; Renee Ren; Eric A Geng; Aonnicha Burapachaisri; Justin E Tang; Jun S Kim; Samuel K Cho

doi:10.1016/j.jor.2023.10.026

. 2023 Nov 5;53:27–33. doi: 10.1016/j.jor.2023.10.026

Can generative artificial intelligence pass the orthopaedic board examination?

Ula N Isleem ¹, Bashar Zaidat ¹, Renee Ren ¹, Eric A Geng ¹, Aonnicha Burapachaisri ¹, Justin E Tang ¹, Jun S Kim ¹, Samuel K Cho ^1,^∗

PMCID: PMC10912220 PMID: 38450060

Abstract

Background

Resident training programs in the US use the Orthopaedic In-Training Examination (OITE) developed by the American Academy of Orthopaedic Surgeons (AAOS) to assess the current knowledge of their residents and to identify the residents at risk of failing the Amerian Board of Orthopaedic Surgery (ABOS) examination. Optimal strategies for OITE preparation are constantly being explored. There may be a role for Large Language Models (LLMs) in orthopaedic resident education. ChatGPT, an LLM launched in late 2022 has demonstrated the ability to produce accurate, detailed answers, potentially enabling it to aid in medical education and clinical decision-making. The purpose of this study is to evaluate the performance of ChatGPT on Orthopaedic In-Training Examinations using Self-Assessment Exams from the AAOS database and approved literature as a proxy for the Orthopaedic Board Examination.

Methods

301 SAE questions from the AAOS database and associated AAOS literature were input into ChatGPT's interface in a question and multiple-choice format and the answers were then analyzed to determine which answer choice was selected. A new chat was used for every question. All answers were recorded, categorized, and compared to the answer given by the OITE and SAE exams, noting whether the answer was right or wrong.

Results

Of the 301 questions asked, ChatGPT was able to correctly answer 183 (60.8%) of them. The subjects with the highest percentage of correct questions were basic science (81%), oncology (72.7%, shoulder and elbow (71.9%), and sports (71.4%). The questions were further subdivided into 3 groups: those about management, diagnosis, or knowledge recall. There were 86 management questions and 47 were correct (54.7%), 45 diagnosis questions with 32 correct (71.7%), and 168 knowledge recall questions with 102 correct (60.7%).

Conclusions

ChatGPT has the potential to provide orthopedic educators and trainees with accurate clinical conclusions for the majority of board-style questions, although its reasoning should be carefully analyzed for accuracy and clinical validity. As such, its usefulness in a clinical educational context is currently limited but rapidly evolving.

Clinical relevance

ChatGPT can access a multitude of medical data and may help provide accurate answers to clinical questions.

1. Introduction

ChatGPT is a large language model (LLM) which was launched as a prototype on November 30th of 2022.¹ The model was developed using a large amount of text data from the internet and is unique in that it was trained to produce answers that are both accurate and natural sounding. This language model has gained mainstream popularity for its detailed and seemingly “human-like” answers. This chatbot also recalls previous prompts in the same chat allowing for a more detailed “conversation” on a certain topic. Due to this, the potential for these artificial intelligence models to assist in medical education, and possibly clinical decision-making, has been brought into question.

The performance of LLMs in the medical field has been explored for improving aspects of patient care, such as diagnosis, imaging analysis, and individualized treatment methods, as well as patient education.²^,³ However, the role of LLMs, particularly ChatGPT, in medical education and assessment of medical students and residents is a topic of debate. The performance of ChatGPT on the United States Medical Licensing Exam (USMLE) Step 1 and 2 has recently been evaluated and achieved a 60% threshold on a commonly utilized dataset, demonstrating the potential ability for the model to ‘pass’ the USMLE at the same level of a third-year medical student⁴^. Furthermore, ChatGPT's ability to perform in medical examinations was evaluated in a study where the model was tested on the Opthalmology Knowledge Assessment Program, which is administered to test the knowledge of ophthalmology residents, and performed at the level of an average first-year resident.⁵

Assessing the medical knowledge of residents enrolled in orthopaedic residency training programs is conventionally assessed by standardized examinations. The American Board of Orthopaedic Surgery (ABOS) written Part I exam and oral Part II exams must be passed for an orthopedic surgeon to be certified. Resident training programs in the US use the Orthopaedic In-Training Examination (OITE) developed by the American Academy of Orthopaedic Surgeons (AAOS) to assess the current knowledge of their residents and to identify the residents at risk of failing the ABOS examinations. All of these factors considered, identifying the optimal strategies to prepare for the OITE is of the utmost importance to orthopaedic residents.

This study aims to evaluate the performance of ChatGPT, a generative AI model, on the Orthopedic In-Training Examination (OITE) and Self-Assessment Examination (SAE) questions developed by the American Academy of Orthopaedic Surgeons (AAOS), as a proxy for the American Board of Orthopaedic Surgeons Examination.

2. Methods

301 orthopedic board exam style-questions were obtained from the past Orthopaedic In-Training Examinations (OITEs) and Self-Assessment Exams (SAEs) dedicated to preparation for the ABOS Part I and II. Questions with associated images necessary to answering the questions were omitted as ChatGPT does not have the capability of analyzing images. Questions were then inputted into ChatGPT's interface in a question and multiple-choice format and the answers were then analyzed to determine which answer choice was selected(see Fig. 1, Fig. 2). A new chat was used for every question as ChatGPT is able to alter its answers and knowledge from previous questions. All answers were recorded, categorized, and compared to the answer given by the OITE and SAE exams, noting whether the answer was right or wrong.

Fig. 1 — ChatGPT responding to a question correctly and justification of response.

Fig. 2 — ChatGPT responding to a question incorrectly and justification of response.

3. Results

Of the 301 questions asked, ChatGPT was able to correctly answer 183 (60.8%) of them. 53 questions were related to spine with 33 (62.3%) of them correctly answered. There were 34 hip and knee questions (18 correct, 52.9%), 32 shoulder and elbow questions (23 correct, 71.9%), 32 trauma questions (17 correct, 53.1%), 31 foot and ankle questions (19 correct, 61.3%), 29 pediatrics questions (12 correct, 41.4%), 22 oncology questions (16 correct, 72.7%), 21 basic science questions (17 correct, 81.0%), 21 sports questions (15 correct, 71.4%), 16 hand questions (8 correct, 50.0%), and 10 other questions (5 correct, 50.0%) (Table 1). The accuracy of ChatGPT based on the subspecialty is demonstrated in Fig. 3.

Table 1.

ChatGPT performance by subject.

Subject, n	Correct, n (%)	Incorrect, n (%)
All (n=301)	183 (60.8)	118 (39.2)
Spine (n=53)	33 (62.3)	20 (37.7)
Hip and Knee (n=34)	18 (52.9)	16 (47.1)
Shoulder and Elbow (n=32)	23 (71.9)	9 (28.1)
Trauma (n=32)	17 (53.1)	15 (46.9)
Foot and Ankle (n=31)	19 (61.3)	12 (38.7)
Pediatrics (n=29)	12 (41.4)	17 (58.6)
Oncology (n=22)	16 (72.7)	6 (27.3)
Basic Science (n=21)	17 (81.0)	4 (19.0)
Sports (n=21)	15 (71.4)	6 (28.6)
Hand (n=16)	8 (50.0)	8 (50.0)
Other (n=10)	5 (50.0)	5 (50.0)

Open in a new tab

Fig. 3 — Accuracy of ChatGPT answers based on question subject

The questions were further subdivided into 3 groups: those about management, diagnosis, or knowledge recall. There were 86 management questions and 47 were correct (54.7%), 45 diagnosis questions with 32 correct (71.7%), and 168 knowledge recall questions with 102 correct (60.7%) (Table 2).

Table 2.

Breakdown by type of question.

Subject, n	Correct, n (%)	Incorrect, n (%)
Management (n=86)	47 (54.7%)	39 (45.3%)
Diagnosis (n=45)	32 (71.1%)	13 (28.9%)
Knowledge Recall (n=168)	102 (60.7%)	66 (39.3%)

Open in a new tab

For each of the answers, the language model provided its reasoning, whether or not the answer was correct. A sample of questions, answers, and bot reasoning is demonstrated in Table 3. For the incorrect question, the bot assumes that leg length discrepancy in a 6 year old boy with femoral osteomyelitis is most likely due a postinfectious growth pattern disruption. However, the most predictable post-traumatic growth pattern disruption is posttraumatic.

Table 3.

Questions and their correct answers vs the answers and reasoning of ChatGPT.

Specialty	Question	Choices	Correct Answer	Bot Answer	Bot Reasoning
Hip and Knee	Of all the clinical scenarios listed below, which one provides an absolute diagnosis of a periprosthetic joint infection?	A) A painful TKA with a serum CRP of 1.8 (0–0.8 normal range) and an ESR of 24 (0–20 normal range) B) A swollen TKA with a mild effusion with growth of Staph Epi on Broth Culture medium C) A draining sinus located at the posterior corner of the knee which upon probing communicates to the knee prosthesis D) A stiff, painful TKA with a positive Technetium bone scan that shows activity around the femoral and tibial components on all 3 phases of the scan E) A painful TKA with an aspiration WBC of 2400 WBC/dl with 60% Neutrophils, a serum CRP of 1.0 (0–0.8 normal range), and an ESR of 18 (0–20 normal range)	C) A draining sinus located at the posterior corner of the knee which upon probing communicates to the knee prosthesis	C) A draining sinus located at the posterior corner of the knee which upon probing communicates to the knee prosthesis provides an absolute diagnosis of a periprosthetic joint infection.	The presence of a draining sinus that communicates with the knee prosthesis is a strong indicator of periprosthetic joint infection. This is known as a sinus tract or fistula, and it is often associated with an infection in the joint. It allows for a direct pathway for infection to exit the joint and is a definitive sign of infection in this context.
Spine	A 15-year-old high school football linebacker is seen in your office for the first time. He has been complaining of pain in his lower back for the past 6 months. Initially it was mild and only present at the end of games. Recently it has begun occurring more frequently. He denies weakness or numbness in his legs but has experienced pain in the left posterior buttock when his pain is most intense. He has not had any treatment to date. The symptoms experienced by this patient are most likely related to which of the following conditions?	A. Lumbar herniated disc B. Spinal stenosis C. Spondylolysis D. Degenerative spondylolisthesis	C. Spondylolysis.	Based on the information provided, the most likely condition that this 15-year-old high school football linebacker is experiencing is C. Spondylolysis.	Spondylolysis is a condition characterized by a stress fracture or defect in the pars interarticularis of the vertebral arch, most commonly occurring in the lower lumbar spine. It often presents with lower back pain, which can be exacerbated during physical activities, especially those that involve extension and rotation of the spine, such as football. The pain may radiate to the buttock or leg, and it tends to worsen with activity. The other conditions mentioned (A. Lumbar herniated disc, B. Spinal stenosis, and D. Degenerative spondylolisthesis) may also cause lower back pain, but given the patient's age, activity level, and the nature of the pain, spondylolysis is the most likely cause. However, a comprehensive evaluation by a healthcare professional, including imaging studies, would be needed to confirm the diagnosis and rule out other potential causes.
Pediatrics	A 6-year-old boy with leg length difference presents to your clinic. He has a history of right distal femoral osteomyelitis at the age of 3. Full length radiographs demonstrate the left femur is 1.5 cm longer than the right. What is the most predictable growth pattern disruption?	A. Postinfectious B. Hemihypertrophy C. Proximal femoral focal deficiency D. Fibular hemimelia E. PostTraumatic	E. PostTraumatic	A. Postinfectious	The history of right distal femoral osteomyelitis at the age of 3 suggests that the growth plate (physis) of the right distal femur may have been affected during the infection. This can lead to postinfectious growth plate disturbance, resulting in a leg length difference where one limb is shorter than the other. The full-length radiograph demonstrates that the left femur is longer than the right, which is consistent with the concept of growth plate disruption causing relative limb length inequality. The other conditions listed (B. Hemihypertrophy, C. Proximal femoral focal deficiency, D. Fibular hemimelia, and E. Post-traumatic) can also lead to limb length discrepancies, but in this case, the history of osteomyelitis suggests that postinfectious growth plate disruption is the most likely cause.

Open in a new tab

4. Discussion

The past decade has seen an exponential increase in the usage of AI in medicine. Models have been developed to interpret chest radiographs, aid in the diagnosis of colon polyps, and predict adverse outcomes.6, 7, 8, 9 As per the nature of orthopedic surgery, computer vision has garnered great interest. Examples include automated lumbar lordosis measurement, fracture detection, and tissue segmentation.10, 11, 12 Another popular domain has been risk prediction, such as predicting complications and mortality after knee or hip replacement or postoperative delirium.¹³^,¹⁴ Textual analysis has been less common. Karhade et al. developed a natural language processing model to detect incidental durotomy from operative notes,¹⁵ while Tang et al. developed a sentiment analysis model for spine surgeon reviews.¹⁶

ChatGPT is a powerful generative text AI that has garnered widespread attention in the scientific community and mainstream media. The software has shown impressive results in generating responses and documentation to seemingly an infinite number of topics to such an extent that it may be indistinguishable from human work. Few studies have attempted to examine the performance of ChatGPT in orthopedic disciplines. Kim et al. assessed the ability of the algorithm to provide basic information on the presentation and management of shoulder impingement syndrome (SIS).¹⁷ Bernstein et al. posed the following question to ChatGPT: “What might I ask my patients to confirm that they truly understand their decision to have a total hip arthroplasty?” He found the algorithm provided a better answer than most students. However, he also noted that while the algorithm may provide factual information, it has yet to approve the ability to apply and manipulate information to the degree of an orthopedic surgeon.¹⁸ The current literature on the utility of ChatGPT for the orthopedic surgeon is lacking and most studies have been qualitative discussions. There is a clear necessity for further validation in the orthopedic specialty.

In the current study, the model performed with an overall accuracy of 60.8%, which is approximately the level of an intern to junior resident. Among sub-specialties, the model performed best in shoulder and elbow, basic science, sports and oncology. The worst performing specialties included pediatrics and hand. There is no direct method to definitively ascertain reasons for better or worse performance in each of the subspecialties. Previous studies have noted that resident performance on OITE shoulder and elbow exam questions were minimally affected by question type.¹⁹ However, questions that require higher level cognition, such as those testing diagnosis and interpretation, have increased in fields including pediatric orthopedics, whereas questions involving simple recall have decreased.²⁰ Murphy et al. found that pediatric orthopedic OITE have significantly increased the number of questions requiring advanced problem-solving, treatment protocols, and fewer questions testing simple knowledge.²¹ These types of complex information may require more intuitive clinical knowledge than merely knowledge of searchable journal or textbook materials, which can explain the poor performance of ChatGPT.

Similarly for adult reconstruction and hand questions, there were significantly more complex multistep questions regarding treatment and clinical management, and fewer one-step knowledge recall questions.²²^,²³ Specifically within hand orthopedic surgery board questions, there has been an increase in clinical management scenarios and questions requiring higher level of evidence²⁴.

On the other hand, basic science had the highest percentage of knowledge recall questions, at 89.7% ²⁵, which matches the high rate of ChatGPT's correct answers in our study. It is important to consider that clinical decision-making is not always black and white. Standardized exams attempt to capture universal guidelines. However, considerable variation in medical management may occur depending on specific patient factors, resource availability, or institutional policy. Therefore, ChatGPT may perform better on questions requiring objective recall, but struggle in areas requiring intuition or application. Given that ChatGPT is trained on a large corpus of text data, bias may occur due to variable representation of orthopedic subspecialties. This factor may also in part explain the variation in performance across sub-specialties.

A major challenge to the application of ChatGPT and other LLMs is that they often seek to answer questions in a definitive way even when it might be more correct to report that there is no consensus. This phenomenon has been described as artificial hallucination or the tendency for LLMs to state plausible phrases without actual reference to its training source, therefore limiting its abilities to provide accurate medical diagnoses or treatments. Additionally, medicine is a continuously evolving field that requires adaptation of treatments to a patient's individual needs. Vaishya et al. found that ChatGPT narrated data limited up to September 2021, and struggled to understand or explain complex medical concepts with precise language and accurate terminology ²⁶. The authors also raise concerns that ChatGPT can produce biased content, as some demographic groups can be underrepresented in the literature. Thus, the current ChatGPT version has several limitations in its usage for healthcare providers.

While ChatGPT is a breakthrough AI model, several limitations exist for orthopedic surgeons to consider. The most concerning is that ChatGPT may provide fabricated or false information, and this is explicitly stated by OpenAI: “ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.” ¹,²⁷. Thus, orthopedic surgeons should caution against information provided by ChatGPT and verify with existing literature before making clinical decisions. It is important to note that ChatGPT does not itself have direct access to databases, such as Pubmed. Additionally, we did not include image analysis in our study, because ChatGPT is unable to analyze radiographs and MRI fluorographs. As such, our study is only representative of OITE exam questions that do not involve clinical photographs or imaging studies.

Although ChatGPT is not currently sufficient for clinical application in orthopaedic education, its performance is remarkable, considering it was released nearly 6 months ago as the writing of this manuscript. The global natural language processing market which was valued at 26.42 billion US dollars in 2022 is expected to increase to over 160 billion within the next 6 years. As the market for these language models expands, the size of LLMs has increased exponentially over the past few years, with nearly a 10x increase from year-to-year starting 2017 ²⁸. Regarding openAI's GPT models, the recently released GPT-4 is reportedly 6x larger than GPT-3, with one trillion parameters. In addition, GPT-4 is a multimodal model, meaning it can accept both image and text inputs; a feature not included in GPT-3. This new LLM also scored in the top 10% of test takers, compared to GPT-3.5 which scored among the bottom 10%.¹ Based on the increasing investment and size put into the development of more advanced models, we can expect that newer models will be trained on larger datasets and will be exposed to more diverse information.

5. Conclusion

In this study, we assessed the ability of ChatGPT to complete orthopedic board exam questions in a multiple-choice format, from SAE and OITE question banks. ChatGPT was able to answer the majority of questions correctly, particularly within the fields of shoulder and elbow, sports, oncology, and basic science. However, its ability to provide logical and informational context should be scrutinized, as we found ChatGPT can fabricate research papers and provide unclear or erroneous details to support its conclusions. ChatGPT may have the potential to provide patients with answer to clinical questions, however the reasoning behind these answers should be analyzed for accuracy and clinical validity. Its usefulness in a clinical educational context is currently limited.

Sources of support

N/A.

Ethical statement

The study does not involve any humans or animal research.

Guardian patients consent

Guardian/Patient consent is not applicable in this study.

Funding declaration

We have not received any funding/financial aid/research grant for this manuscript.

Declaration of competing interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

CRediT authorship contribution statement

Ula N. Isleem: Conceptualization, Data curation, Methodology, Writing – original draft, Writing – review & editing. Bashar Zaidat: Data curation, Formal analysis, Writing – original draft. Renee Ren: Data curation, Writing – original draft, Writing – review & editing. Eric A. Geng: Data curation, Writing – original draft. Aonnicha Burapachaisri: Data curation. Justin E. Tang: Writing – original draft. Jun S. Kim: Supervision, Writing – review & editing. Samuel K. Cho: Conceptualization, Supervision, Writing – review & editing.

Acknowledgements

There are no acknowledgements to disclose.

References

1.Open AI . 2022. ChatGPT.https://openai.com/blog/chatgpt/ [1] Date. [Google Scholar]
2.Arora A., Arora A. Generative adversarial networks and synthetic patient data: current challenges and future perspectives. Future Healthcare Journal. 2022 Jul;9(2):190. doi: 10.7861/fhj.2022-0013. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Straw I., Callison-Burch C. Artificial Intelligence in mental health and the biases of language based models. PLoS One. 2020 Dec 17;15(12) doi: 10.1371/journal.pone.0240376. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gilson A., Safranek C.W., Huang T., et al. How does CHATGPT perform on the United States Medical Licensing Examination? the implications of large language models for medical education and knowledge assessment. JMIR Medical Education. 2023;9(1) doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Antaki F., Touma S., Milad D., El-Khoury J., Duval R. Evaluating the performance of chatgpt in ophthalmology: an analysis of its successes and shortcomings. medRxiv. 2023;3(4) doi: 10.1016/j.xops.2023.100324. Jan 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kallianos K., Mongan J., Antani S., et al. How far have we come? Artificial intelligence for chest radiograph interpretation. Clin Radiol. 2019 May 1;74(5):338–345. doi: 10.1016/j.crad.2018.12.015. [DOI] [PubMed] [Google Scholar]
7.Byrne M.F., Chapados N., Soudan F., et al. Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut. 2019 Jan 1;68(1):94–100. doi: 10.1136/gutjnl-2017-314547. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bertini A., Salas R., Chabert S., Sobrevia L., Pardo F. Using Machine learning to predict complications in pregnancy: a systematic review. Front Bioeng Biotechnol. 2022 Jan 19;9:1385. doi: 10.3389/fbioe.2021.780389. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ljubic B., Hai A.A., Stanojevic M., et al. Predicting complications of diabetes mellitus using advanced machine learning algorithms. J Am Med Inf Assoc. 2020 Sep 1;27(9):1343–1351. doi: 10.1093/jamia/ocaa120. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Cho B.H., Kaji D., Cheung Z.B., et al. Automated measurement of lumbar lordosis on radiographs using machine learning and computer vision. Global Spine J. 2020 Aug;10(5):611–618. doi: 10.1177/2192568219868190. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Muehlematter U.J., Mannil M., Becker A.S., et al. Vertebral body insufficiency fractures: detection of vertebrae at risk on standard CT images using texture analysis and machine learning. Eur Radiol. 2019 May 1;29:2207–2217. doi: 10.1007/s00330-018-5846-8. [DOI] [PubMed] [Google Scholar]
12.Liu F., Zhou Z., Jang H., Samsonov A., Zhao G., Kijowski R. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson Med. 2018 Apr;79(4):2379–2391. doi: 10.1002/mrm.26841. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Harris A.H., Kuo A.C., Weng Y., Trickey A.W., Bowe T., Giori N.J. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty? Clin Orthop Relat Res. 2019 Feb;477(2):452. doi: 10.1097/CORR.0000000000000601. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Oosterhoff J.H., Karhade A.V., Oberai T., Franco-Garcia E., Doornberg J.N., Schwab J.H. Prediction of postoperative delirium in geriatric hip fracture patients: a clinical prediction model using machine learning algorithms. Geriatr Orthop Surg Rehabil. 2021 Dec 12;12 doi: 10.1177/21514593211062277. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Karhade A.V., Bongers M.E., Groot O.Q., et al. Natural language processing for automated detection of incidental durotomy. Spine J. 2020 May 1;20(5):695–700. doi: 10.1016/j.spinee.2019.12.006. [DOI] [PubMed] [Google Scholar]
16.Tang J.E., Arvind V., White C.A., Dominy C., Kim J.S., Cho S.K. What are patients saying about you online? A sentiment analysis of online written reviews on Scoliosis Research Society surgeons. Spine Deformity. 2021 Oct 2:1–6. doi: 10.1007/s43390-021-00419-y. [DOI] [PubMed] [Google Scholar]
17.Kim J.H. Search for medical information and treatment options for musculoskeletal disorders through an artificial intelligence chatbot: focusing on shoulder impingement syndrome. medRxiv. 2022 Dec 18 2022-12. [Google Scholar]
18.Bernstein J. Not the last word: ChatGPT can't perform orthopaedic surgery. Clin Orthop Relat Res. 2022 May 10:10–97. doi: 10.1097/CORR.0000000000002619. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Osbahr D.C., Cross M.B., Taylor S.A., Bedi A., Dines D.M., Dines J.S. An analysis of the shoulder and elbow section of the orthopedic in-training examination. Am J Orthoped. 2012 Feb 1;41(2):63–68. [PubMed] [Google Scholar]
20.Ellsworth B.K., Premkumar A., Shen T., Lebrun D.G., Cross M.B., Widmann R.F. An updated analysis of the pediatric section of the orthopaedic in-training examination. J Pediatr Orthop. 2020 Nov 9;40(10):e1017–e1021. doi: 10.1097/BPO.0000000000001663. [DOI] [PubMed] [Google Scholar]
21.Murphy R.F., Nunez L., Barfield W.R., Mooney J.F., 3rd Evaluation of pediatric questions on the orthopaedic in-training examination-an update. J Pediatr Orthop. 2017 Sep;37(6):e394–e397. doi: 10.1097/BPO.0000000000000913. PMID: 27977498. [DOI] [PubMed] [Google Scholar]
22.Premkumar A., Lebrun D.G., Shen T.S., Ellsworth B.K., Bostrom M.P., Cross M.B. Analysis of hip and knee reconstruction questions on the Orthopedic In-Training Examination. J Arthroplasty. 2021 Mar 1;36(3):1156–1159. doi: 10.1016/j.arth.2020.09.018. [DOI] [PubMed] [Google Scholar]
23.LeBrun D.G., Premkumar A., Ellsworth B., Shen T.S., Cross M.B., Fufa D.T. Analysis of hand Surgery questions on orthopedic in-training examination from 2014 to 2019. Hand. 2022 Sep;17(5):975–982. doi: 10.1177/1558944720964960. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Grandizio L.C., Huston J.C., Shim S.S., Graham J., Klena J.C. Levels of evidence for hand questions on the orthopaedic in-training examination. Hand. 2016;11(4):484–488. doi: 10.1177/1558944715620793. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Shen T.S., Driscoll D.A., Ellsworth B.K., et al. Analysis of the basic science questions on the Orthopaedic In-Training Examination from 2014 to 2019. J Am Acad Orthop Surg. 2021 Dec 1;29(23):e1225–e1231. doi: 10.5435/JAAOS-D-20-00862. [DOI] [PubMed] [Google Scholar]
26.Vaishya R., Misra A., Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metabol Syndr. 2023 Apr;17(4) doi: 10.1016/j.dsx.2023.102744. Epub 2023 Mar 15. PMID: 36989584. [DOI] [PubMed] [Google Scholar]
27.van Dis E.A., Bollen J., Zuidema W., van Rooij R., Bockting C.L. ChatGPT: five priorities for research. Nature. 2023 Feb 9;614(7947):224–226. doi: 10.1038/d41586-023-00288-7. [DOI] [PubMed] [Google Scholar]
28.Wang H., Zhang Z., Wu Z., et al. Efficient Natural Language Processing. MIT-IBM Watson AI Lab; [cited 2023Mar28] https://hanlab.mit.edu/projects/efficientnlp_old/ [Internet], Available from:

[bib1] 1.Open AI . 2022. ChatGPT.https://openai.com/blog/chatgpt/ [1] Date. [Google Scholar]

[bib2] 2.Arora A., Arora A. Generative adversarial networks and synthetic patient data: current challenges and future perspectives. Future Healthcare Journal. 2022 Jul;9(2):190. doi: 10.7861/fhj.2022-0013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Straw I., Callison-Burch C. Artificial Intelligence in mental health and the biases of language based models. PLoS One. 2020 Dec 17;15(12) doi: 10.1371/journal.pone.0240376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Gilson A., Safranek C.W., Huang T., et al. How does CHATGPT perform on the United States Medical Licensing Examination? the implications of large language models for medical education and knowledge assessment. JMIR Medical Education. 2023;9(1) doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Antaki F., Touma S., Milad D., El-Khoury J., Duval R. Evaluating the performance of chatgpt in ophthalmology: an analysis of its successes and shortcomings. medRxiv. 2023;3(4) doi: 10.1016/j.xops.2023.100324. Jan 24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Kallianos K., Mongan J., Antani S., et al. How far have we come? Artificial intelligence for chest radiograph interpretation. Clin Radiol. 2019 May 1;74(5):338–345. doi: 10.1016/j.crad.2018.12.015. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Byrne M.F., Chapados N., Soudan F., et al. Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut. 2019 Jan 1;68(1):94–100. doi: 10.1136/gutjnl-2017-314547. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Bertini A., Salas R., Chabert S., Sobrevia L., Pardo F. Using Machine learning to predict complications in pregnancy: a systematic review. Front Bioeng Biotechnol. 2022 Jan 19;9:1385. doi: 10.3389/fbioe.2021.780389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Ljubic B., Hai A.A., Stanojevic M., et al. Predicting complications of diabetes mellitus using advanced machine learning algorithms. J Am Med Inf Assoc. 2020 Sep 1;27(9):1343–1351. doi: 10.1093/jamia/ocaa120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Cho B.H., Kaji D., Cheung Z.B., et al. Automated measurement of lumbar lordosis on radiographs using machine learning and computer vision. Global Spine J. 2020 Aug;10(5):611–618. doi: 10.1177/2192568219868190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Muehlematter U.J., Mannil M., Becker A.S., et al. Vertebral body insufficiency fractures: detection of vertebrae at risk on standard CT images using texture analysis and machine learning. Eur Radiol. 2019 May 1;29:2207–2217. doi: 10.1007/s00330-018-5846-8. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Liu F., Zhou Z., Jang H., Samsonov A., Zhao G., Kijowski R. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson Med. 2018 Apr;79(4):2379–2391. doi: 10.1002/mrm.26841. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Harris A.H., Kuo A.C., Weng Y., Trickey A.W., Bowe T., Giori N.J. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty? Clin Orthop Relat Res. 2019 Feb;477(2):452. doi: 10.1097/CORR.0000000000000601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Oosterhoff J.H., Karhade A.V., Oberai T., Franco-Garcia E., Doornberg J.N., Schwab J.H. Prediction of postoperative delirium in geriatric hip fracture patients: a clinical prediction model using machine learning algorithms. Geriatr Orthop Surg Rehabil. 2021 Dec 12;12 doi: 10.1177/21514593211062277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Karhade A.V., Bongers M.E., Groot O.Q., et al. Natural language processing for automated detection of incidental durotomy. Spine J. 2020 May 1;20(5):695–700. doi: 10.1016/j.spinee.2019.12.006. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Tang J.E., Arvind V., White C.A., Dominy C., Kim J.S., Cho S.K. What are patients saying about you online? A sentiment analysis of online written reviews on Scoliosis Research Society surgeons. Spine Deformity. 2021 Oct 2:1–6. doi: 10.1007/s43390-021-00419-y. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Kim J.H. Search for medical information and treatment options for musculoskeletal disorders through an artificial intelligence chatbot: focusing on shoulder impingement syndrome. medRxiv. 2022 Dec 18 2022-12. [Google Scholar]

[bib18] 18.Bernstein J. Not the last word: ChatGPT can't perform orthopaedic surgery. Clin Orthop Relat Res. 2022 May 10:10–97. doi: 10.1097/CORR.0000000000002619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Osbahr D.C., Cross M.B., Taylor S.A., Bedi A., Dines D.M., Dines J.S. An analysis of the shoulder and elbow section of the orthopedic in-training examination. Am J Orthoped. 2012 Feb 1;41(2):63–68. [PubMed] [Google Scholar]

[bib20] 20.Ellsworth B.K., Premkumar A., Shen T., Lebrun D.G., Cross M.B., Widmann R.F. An updated analysis of the pediatric section of the orthopaedic in-training examination. J Pediatr Orthop. 2020 Nov 9;40(10):e1017–e1021. doi: 10.1097/BPO.0000000000001663. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Murphy R.F., Nunez L., Barfield W.R., Mooney J.F., 3rd Evaluation of pediatric questions on the orthopaedic in-training examination-an update. J Pediatr Orthop. 2017 Sep;37(6):e394–e397. doi: 10.1097/BPO.0000000000000913. PMID: 27977498. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Premkumar A., Lebrun D.G., Shen T.S., Ellsworth B.K., Bostrom M.P., Cross M.B. Analysis of hip and knee reconstruction questions on the Orthopedic In-Training Examination. J Arthroplasty. 2021 Mar 1;36(3):1156–1159. doi: 10.1016/j.arth.2020.09.018. [DOI] [PubMed] [Google Scholar]

[bib23] 23.LeBrun D.G., Premkumar A., Ellsworth B., Shen T.S., Cross M.B., Fufa D.T. Analysis of hand Surgery questions on orthopedic in-training examination from 2014 to 2019. Hand. 2022 Sep;17(5):975–982. doi: 10.1177/1558944720964960. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Grandizio L.C., Huston J.C., Shim S.S., Graham J., Klena J.C. Levels of evidence for hand questions on the orthopaedic in-training examination. Hand. 2016;11(4):484–488. doi: 10.1177/1558944715620793. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Shen T.S., Driscoll D.A., Ellsworth B.K., et al. Analysis of the basic science questions on the Orthopaedic In-Training Examination from 2014 to 2019. J Am Acad Orthop Surg. 2021 Dec 1;29(23):e1225–e1231. doi: 10.5435/JAAOS-D-20-00862. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Vaishya R., Misra A., Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metabol Syndr. 2023 Apr;17(4) doi: 10.1016/j.dsx.2023.102744. Epub 2023 Mar 15. PMID: 36989584. [DOI] [PubMed] [Google Scholar]

[bib27] 27.van Dis E.A., Bollen J., Zuidema W., van Rooij R., Bockting C.L. ChatGPT: five priorities for research. Nature. 2023 Feb 9;614(7947):224–226. doi: 10.1038/d41586-023-00288-7. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Wang H., Zhang Z., Wu Z., et al. Efficient Natural Language Processing. MIT-IBM Watson AI Lab; [cited 2023Mar28] https://hanlab.mit.edu/projects/efficientnlp_old/ [Internet], Available from:

PERMALINK

Can generative artificial intelligence pass the orthopaedic board examination?

Ula N Isleem

Bashar Zaidat

Renee Ren

Eric A Geng

Aonnicha Burapachaisri

Justin E Tang

Jun S Kim

Samuel K Cho