ChatGPT-4o Mini Fabricates and Miscites Evidence for American Academy of Orthopaedic Surgeons Hip Fracture Clinical Practice Guidelines

David McCavitt; Soroush Shabani; Ashley Mulakaluri; Sahil Dhandi; Andrew Duong; Joseph T Patterson

doi:10.2106/JBJS.OA.25.00225

. 2026 Feb 3;11(1):e25.00225. doi: 10.2106/JBJS.OA.25.00225

ChatGPT-4o Mini Fabricates and Miscites Evidence for American Academy of Orthopaedic Surgeons Hip Fracture Clinical Practice Guidelines

David McCavitt ¹, Soroush Shabani ¹, Ashley Mulakaluri ¹, Sahil Dhandi ¹, Andrew Duong ¹, Joseph T Patterson ^1,^a

PMCID: PMC12854652 PMID: 41625387

Abstract

Background:

Generative artificial intelligence (AI) large language model (LLM) chatbots, such as ChatGPT, are increasingly used to answer medical questions. This study sought to assess the accuracy and quality of evidence cited in ChatGPT-4o mini responses to questions pertaining to hip fracture care.

Methods:

Prompt questions regarding hip fracture management that aligned with each of the 19 recommendations published in the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guideline (CPG) for Management of Hip Fractures in Older Adults were posed to the ChatGPT-4o mini LLM asynchronously by 4 independent medical student graders. Three prompt variations were applied for each recommendation, reflecting the perspectives of a physician, a patient, and a general information seeker. Graders then requested from the LLM a reference list with PubMed Identifier (PMID) numbers supporting each recommendation. Accuracy and clarity of responses were assessed using a standard rubric for overlap with CPG citations, fabrications, and inaccurate citations.

Results:

ChatGPT-4o mini returned 228 responses to prompts seeking advice on AAOS CPG hip management recommendations. 76.3% of responses were “accurate” to the CPG recommendation. 88.2% of responses received a clarity rating of “excellent”. ChatGPT-4o mini provided 228 responses citing 2,556 publications when prompted for supporting evidence, of which 1.1% overlapped with AAOS CPG references, and 7.9% were fabricated. Of the publications cited by the LLM which exist in the PubMed index, 91.7% were given with incorrect authors, 91.5% incorrect titles, 91.4% incorrect pages, 91.0% incorrect PMIDs, 90.9% incorrect journals, 90.3% incorrect journal volumes, and 20.0% incorrect publication years. Responses for an AAOS CPG strong recommendation strength were significantly more likely to be “accurate” (p < 0.001), and responses for an AAOS CPG limited strength recommendation were significantly more likely to be “unsupported” (p < 0.001).

Conclusions:

Level of Evidence:

Level V Therapeutic. See Instructions for Authors for a complete description of levels of evidence.

Introduction

Clinical practice guidelines (CPG) including those published by the American Academy of Orthopaedic Surgeons (AAOS) synthesize published research and provide evidence-based recommendations to help patients, caregivers, and physicians navigate making medical decisions¹. The current AAOS CPG on Management of Hip Fractures in Older Adults provides 19 recommendations for treatment choices in hip fracture care including strength ratings for each recommendation based on the quality of available evidence¹. While this document stands as a comprehensive reference, patients, physicians, and other stakeholders faced with healthcare decisions are increasingly querying large language models (LLMs) for answers to questions about treatment choices².

The capacity for transformative LLMs such as the Chat Generative Pretrained Transformer (ChatGPT, OpenAI) to process large amounts of data and interact conversationally with users has led to increasing LLM use across medical specialties^3,4. Public use of the ChatGPT LLM is particularly popular². ChatGPT can interpret and navigate the nuances of medical literature to synthesize data and generate coherent, referenced responses for questions commonly asked by patients undergoing orthopaedic procedures^2,4. Specifically for patient questions about hip fracture care, a previous version of ChatGPT provided unbiased and evidence-based answers that could be understood by most orthopaedic patients⁵. However, ChatGPT 4.0 responses to hip fracture care queries were not consistent with AAOS Appropriate Use Criteria (AUC), and the LLM was inadequate in selecting the appropriate treatment that was deemed acceptable, most reasonable, and most likely to improve patient outcomes⁶. ChatGPT citation fabrication has also been documented⁷. Upon its release on July 18, 2024, ChatGPT-4o mini was noted to be OpenAI’s most cost-efficient model that surpassed prior releases in both textual intelligence and reasoning tasks⁸. However, it is not known if ChatGPT-4o mini provides accurate recommendations based on the AAOS CPG on Management of Hip Fractures in Older Adults or fabricates evidence to support its responses.

The purpose of this study was to assess the accuracy, scope, and fabrication of supporting evidence in responses provided by the presently available version of ChatGPT to queries derived from the AAOS CPG on Management of Hip Fractures in Older Adults. We hypothesized that ChatGPT would provide accurate guidance within the scope of the CPG but would fabricate references to support its responses.

Materials and Methods

Study Design

An in silico observational study was performed by prospectively prompting a commercially available LLM. This investigation used a single LLM, ChatGPT (Version 4o mini, free public version, OpenAI), as this product accounted for nearly 50% of AI chatbot traffic at the time of investigation⁹. Each of the 19 recommendations published in the AAOS Clinical Practice Guideline on Management of Hip Fractures in Older Adults updated (updated December 3, 2021) was structured as a query to ChatGPT 4o mini, for a treatment recommendation on each clinical dilemma. Three prompts were then engineered for each recommendation query, posing the queries from 3 perspectives including those of a physician requesting treatment guidance, a patient seeking information about treatment choices, and a general member of the public seeking health information. Prompt engineering and a data collection protocol were developed by 4 unique graders using published recommendations and a Delphi Method approach with the senior author acting as a facilitator^7,10. Medical student graders with content expertise and prior experience with LLM prompt engineering for research purposes were trained with the AAOS Management of Hip Fractures in Older Adults CPG, and were considered experts capable of contributing to consensus establishment via the Delphi Method. Institutional Review Board approval was not required to conduct this investigation which did not include protected health information.

Data Collection

Prompts were submitted to ChatGPT independently and asynchronously by the 4 unique graders between October and November of 2024. Each grader queried the LLM after each prompt response with a second, standard prompt requesting references for supporting evidence. Each grader submitted each pair of prompt and request for supporting evidence using a newly opened private browsing window to eliminate prior conversation history. PubMed ID numbers were collected for each citation provided in the ChatGPT-4o mini responses.

Response Assessment

A rubric was developed to grade responses to prompts across 5 domains including accuracy, CPG support for the recommendation, inclusion of supplemental information, completeness, and clarity based on previous LLM investigations^2,7,11,12. Responses were independently assessed by 4 medical student graders using the rubric. The response was considered accurate if the ChatGPT response recommendation aligned with the CPG recommendation. If the CPG indicated insufficient evidence to provide a recommendation and ChatGPT provided a recommendation, the response was considered unsupported. Supplemental information was defined as additional treatment recommendations not specified by the CPG. The response was considered complete if the recommendation did not omit relevant information specified by the CPG. Clarity was graded as “excellent” if no clarification was needed to understand and implement the recommendations provided in the response, “satisfactory” if minimal clarification was required, “unclear” if substantial clarification was required, or “unsatisfactory” if the response could not be implemented. An example of the response evaluation rubric is provided in the supplemental material (Supplemental Table 1).

Responses to requests for supporting evidence were graded across 3 domains including reference overlap for each recommendation in the CPG, fabrication of references, and reference inaccuracies. Overlap was defined as the percentage of ChatGPT-provided references that matched the CPG references supporting each published recommendation. A fabricated reference was defined as a reference cited by ChatGPT but not indexed in PubMed. Inaccuracies were defined as discrepancies between references provided by ChatGPT and references indexed in PubMed with regard to title, authors, journal, publication year, page number, volume, and PMID.

Statistical Analysis

Prompt grading was summarized as raw counts and percentages. Response domains for prompts were compared across the perspective groups by Chi-square for categorical data and Kruskal-Wallis tests for continuous data. Response domains for supporting evidence were compared across CPG “limited,” “moderate,” and “strong” grades of recommendation strength by χ² and Kruskal-Wallis tests. Nonparametric distribution was confirmed with the Shapiro-Wilk test. Statistical significance was determined at a p-value <0.05 without correction for multiple hypothesis testing. No data were imputed as no data were missing. Statistical analyses were performed with R Version 4.4.1 (R foundation).

Results

Graders assessed 456 unique responses from 456 ChatGPT queries, including 228 responses to prompts seeking advice for hip fracture management and 228 responses to requests for supporting evidence, which yielded 2,556 references. Responses to prompts for treatment guidance were graded 76.3% accurate, 7.0% unsupported, 61.8% supplemental, and 9.7% incomplete. Stratified by strength of AAOS CPG recommendation, significantly more responses to queries reflecting strong recommendations were graded as accurate compared with limited strength recommendations (112/132 [84.8%] versus 20/36 [55.6%], p < 0.001). Significantly more responses to limited strength guideline queries were rated as unsupported compared with both moderate and strong strength guideline queries (11/36 [30.6%] versus 2/60 [3.3%] and 3/132, [2.3%], respectively, p < 0.001). In terms of response clarity, 88.2% of responses were rated excellent, 10.5% satisfactory, 1.3% unclear, and none were unsatisfactory. No significant differences in clarity were observed when stratified by CPG recommendation strength (Table I).

TABLE I.

Subgroup Analysis of Accuracy and Clarity by Recommendation Strength of AAOS Guideline

	Limited (%) (N = 36)	Moderate (%) (N = 60)	Strong (%) (N = 132)	p
Accurate	20/36 (55.6)	42/60 (70.0)	112/132 (84.8)	<0.001
Unsupported	11/36 (30.6)	2/60 (3.3)	3/132 (2.3)	<0.001
Supplemental	21/36 (58.3)	40/60 (66.7)	80/132 (60.1)	0.649
Incomplete	6/36 (16.7)	4/60 (6.7)	12/132 (9.1)	0.260
Clarity				0.124
Excellent	30/36 (83.3)	52/60 (86.7)	120/132 (90.9)
Satisfactory	4/36 (11.1)	7/60 (11.7)	12/132 (9.1)
Unclear	2/36 (5.6)	1/60 (1.7)	0/132
Unsatisfactory	0/36	0/60	0/132

Open in a new tab

AAOS = Academy of Orthopaedic Surgeons. Bolded values indicate statistical significance at p <0.05.

Stratified by prompt perspective, no significant differences were observed across the general, physician, and patient perspectives with regard to accuracy, support, supplemental information, and completeness (Table II). However, only 1.1% of references provided by ChatGPT exactly matched references cited in the AAOS CPG. ChatGPT completely fabricated 7.9% of references. Of the references provided by ChatGPT and indexed in PubMed, 91.7% were cited with incorrect authors, 91.5% with incorrect title, 91.4% with incorrect page numbers, 91.0% with incorrect PMID, 90.9% with incorrect journal title, 90.3% with incorrect journal volume or issue, and 20.0% with incorrect publication year (Table III). No significant differences in reference overlap, fabricated citations, total references, and inaccuracies in indexable sources were observed when stratified by CPG recommendation strength (Supplemental Table 2) or prompt perspective (Supplemental Table 3).

TABLE II.

Subgroup Analysis of Accuracy and Clarity by Prompt Type

	General (N = 76)	Physician (N = 76)	Patient (N = 76)	p
Accurate	59/76 (77.6)	56/76 (73.7)	59/76 (77.6)	0.804
Unsupported	8/76 (10.5)	7/76 (9.2)	1/76 (1.3)	0.056
Supplemental	48/76 (63.2)	48/76 (63.2)	45/76 (59.2)	0.846
Incomplete	6/76 (7.9)	7/76 (9.2)	9/76 (11.8)	0.703
Clarity				0.432
Excellent	66/76 (86.8)	70/76 (92.1)	66/76 (86.8)
Satisfactory	10/76 (13.2)	5/76 (6.6)	8/76 (10.5)
Unclear	0/76	1/76 (1.3)	2/76 (2.6)
Unsatisfactory	0/76	0/76	0/76

Open in a new tab

TABLE III.

Mean Counts of Citation Characteristics Per Single ChatGPT Response

Response Domain	Mean ± SD
Citation overlap with guidelines	0.12 ± 0.39
Fabricated references	0.59 ± 1.10
Total references	7.53 ± 2.28
Title incorrect	6.89 ± 2.61
Author incorrect	6.90 ± 2.62
Journal incorrect	6.84 ± 2.58
Year incorrect	1.50 ± 1.47
Pages incorrect	6.88 ± 2.65
Volume incorrect	6.80 ± 2.64
PMID incorrect	6.92 ± 2.58

Open in a new tab

PMID = PubMed Identifier.

Discussion

ChatGPT-4o mini provided very clear and moderately accurate responses to queries about hip fracture care derived from the AAOS Clinical Practice Guideline on Management of Hip Fractures in Older Adults. However, ChatGPT generated rampantly erroneous and occasionally fabricated citations in support of these recommendations. The performance of ChatGPT in answering established clinically relevant questions about hip fracture care was relatively independent of the strength of the particular recommendation and the perspective of the individual querying the large language model.

Reference fabrication is an established shortcoming of transformative LLMs including ChatGPT. Three studies of a previous version of ChatGPT (3.5) identified reference fabrication rates of 15.12%, 47%, and 60% in stem cell (reference n = 86), mixed subject (n = 115), and psychiatry research (n = 35), respectively.^7,13,14 Successive updates to ChatGPT have exhibited lower rates of reference fabrication. Walters et al. found fabrications fell from 55% to 18%, and Chelli et al. found fabrications fell from 39.6% to 28.6% when comparing version 3.5 to 4.0^12,15. Kim et al., in a recent study testing prompts built from the same AAOS hip fracture CPG analyzed in this work, found that compared with ChatGPT version 4.0, the currently available 4o generated more reliable information and had a reduced fabrication rate, although they did not explicitly quantify this frequency.¹⁶ Our finding of a 7.9% reference fabrication rate across 2,556 references in ChatGPT suggests continued improvement in fabrication frequency. However, even a single falsified reference undermines LLM credibility. Our finding that only 1.1% of ChatGPT's references fully matched AAOS CPG citations underscores the lack of fidelity that this transformative LLM brings to urgent and clinically relevant questions faced by millions of adults annually. The root cause of reference fabrication has been attributed to inconsistencies, inaccuracies, and/or problematic patterns in input data used for training, with resultant errors in gap-filling when the software attempts to synthesize a plausible response.^17-19 Lack of access to real-time internet has also been cited as a limitation that may contribute to response errors, but reference fabrication rates by Bard, which operates with an internet connection, have been documented at rates as high as 91.4% in recent literature.^12,14 Thus, some experts believe that “hallucinations” are fundamental to LLMs, as their output generation seeks merely to satisfy statistical patterns of co-occurrence learned during training to probabilistically link pieces of text^7,14.

Errors across individual elements within nonfabricated citations, while perhaps more forgivable than outright fabrications, remain a problematic barrier to implementing LLMs as clinical decision support tools. Studies quantifying reference inaccuracies from version ChatGPT 3.5 vary widely in frequency of references with incorrect information (range: 9%-46%)^7,13,14. Our findings illustrate far worse term accuracy of PubMed-indexable references, with inaccuracies in more than 90% of such citations. Bhattacharyya et al. found inaccurate PMIDs in 93% of ChatGPT 3.5-produced references, a phenomenon the authors postulated may be due to errors in the digital object identifier (DOI)⁷. These errors in handling DOI data do not appear have been adequately addressed in the 4o mini update.

Conversely, subjective evaluations of ChatGPT responses to orthopaedic queries indicate progress across version updates. Duey et al. compared recommendations made by ChatGPT 3.5 and 4.0 to North American Spine Society (NASS) clinical guidelines and found significantly more responses were accurate to the guidelines in version 4.0 (92%), and significantly fewer responses in version 4.0 were overconclusive (8%)¹¹. Ahmed et al., in a similar NASS guideline analysis, reported a response accuracy of 67.9% to guideline content in version 4.0, and a similar percentage in version 3.5¹⁰. However, Gianola et al. reported poor accuracy in their finding that only 33% of ChatGPT 3.5 responses were consistent with the lumbosacral pain CPG²⁰. The 76% response accuracy, low 7% rate of unsupported recommendations, and infrequent “overconclusive” responses in this study may be associated with the newer LLM version, the specificity of the individual CPG recommendations, the strength of the underlying evidence, and the sample size: Duey et al. reviewed 12 responses, Ahmed et al. 28, and Gianola et al. 9, compared with 228 responses that were analyzed in the current study^10,11,20.

In terms of response readability, Mika et al. found 70% (7/10) of ChatGPT 3.5 answers to 10 commonly asked total hip arthroplasty questions required only minimal or moderate clarification². Our analysis of ChatGPT was more favorable, with 88.2% of responses graded as excellent (not requiring additional clarification). Our findings suggest that version updates have improved the accuracy and readability of responses that ChatGPT provides to questions seeking a medical recommendation.

There are several limitations to this investigation. The evolving nature of ChatGPT makes it difficult to generalize our findings beyond the available version of ChatGPT. In addition, we are not aware of a standardized tool for assessing LLM responses. Some of the criteria evaluated in this study were entirely subjective. We attempted to reduce bias by protocolizing the query, grading response assessment with a rubric, and limiting interactions to 2 standardized queries per recommendation and perspective. It is possible that additional conversational queries could have clarified responses and yielded more accurate references. It is also possible that ChatGPT was trained on data that postdate the CPG 2021 publication date, and that this newer literature incorporates relevant findings not captured in the CPG. Finally, these findings are limited in scope to the AAOS 2021 CPG governing hip fracture management in older adults and are not necessarily generalizable to other areas of disease and intervention.

Future investigations of ChatGPT performance may build upon these combined analyses of response quality and supporting evidence that are central to our methodology. Increasingly sophisticated analyses of ChatGPT move closer to the goal of a standardized benchmarking tool that can comprehensively assess LLM performance and better evaluate the many functional domains that these tools must master before entry into the clinical space.

Conclusion

ChatGPT-4o mini provided clear, moderately accurate responses with rampantly erroneous and occasionally fabricated citations to queries about hip fracture care derived from the AAOS Clinical Practice Guideline on Management of Hip Fractures in Older Adults. ChatGPT demonstrates promise but presently unreliable performance to warrant use by patients, caregivers, and medical professionals as an adjunctive tool to existing guidelines for shared decision-making about hip fracture care.

Appendix

Supporting material provided by the authors is posted with the online version of this article as a data supplement at jbjs.org (http://links.lww.com/JBJSOA/B101). This content was not copyedited or verified by JBJS.

Footnotes

Investigation performed at Keck School of Medicine of the University of Southern California, Los Angeles, CA

Disclosure: The Disclosure of Potential Conflicts of Interest forms are provided with the online version of the article (http://links.lww.com/JBJSOA/B100).

Contributor Information

David McCavitt, Email: mccavitt@usc.edu.

Soroush Shabani, Email: sshabani@usc.edu.

Ashley Mulakaluri, Email: mulakalu@usc.edu.

Sahil Dhandi, Email: dhandi@usc.edu.

Andrew Duong, Email: amduong@usc.edu.

References

1.American Academy of Orthopaedic Surgeons Management of Hip Fractures in Older Adults Evidence-Based Clinical Practice Guideline. American Academy of Orthopaedic Surgeons; https://www.aaos.org/hipfxcpg (2021). [DOI] [PubMed] [Google Scholar]
2.Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Jt Surg. 2023;105(19):1519-26. [DOI] [PubMed] [Google Scholar]
3.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-40. [DOI] [PubMed] [Google Scholar]
4.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med. 2023;388(13):1233-9. [DOI] [PubMed] [Google Scholar]
5.Wrenn SP, Mika AP, Ponce RB, Mitchell PM. Evaluating ChatGPT's ability to answer common patient questions regarding hip fracture. J Am Acad Orthop Surg. 2024;32(14):656-659. [DOI] [PubMed] [Google Scholar]
6.Nietsch KS, Shrestha N, Mazudie Ndjonko LC, Ahmed W, Mejia MR, Zaidat B, Ren R, Duey AH, Li SQ, Kim JS, Hidden KA, Cho SK. Can large language models (LLMs) predict the appropriate treatment of acute hip fractures in older adults? Comparing appropriate use criteria with recommendations from ChatGPT. J Am Acad Orthop Surg Glob Res Rev. 2024;8(8):e24.00206. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-Generated medical content. Cureus. 2023;15(5):e39238. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.GPT-4o mini: advancing cost-efficient intelligence | OpenAI. Available at: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed July 25, 2025.
9.Venditti B. The 10 most-used AI Chatbots in 2025. Visual capitalist; 2025. Available at: https://www.visualcapitalist.com/the-10-most-used-ai-chatbots-in-2025/. Accessed December 1, 2025.
10.Ahmed W, Saturno M, Rajjoub R, Duey AH, Zaidat B, Hoang T, Restrepo Mejia M, Gallate ZS, Shrestha N, Tang J, Zapolsky I, Kim JS, Cho SK. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J. 2024;33(11):4182-203. [DOI] [PubMed] [Google Scholar]
11.Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LCM, Shrestha N, Rajjoub R, Ahmed W, Hoang T, Saturno MP, Tang JE, Gallate ZS, Kim JS, Cho SK. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23(11):1684-91. [DOI] [PubMed] [Google Scholar]
12.Chelli M, Lavou&eacute V, Trojani C, Azar M, Deckert M, Raynier JL, Clowez G, Boileau P, Ruetsch-Chelli C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis. J Med Internet Res. 2024;26(3):e53164. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Sharun K, Banu SA, Pawde AM, Kumar R, Akash S, Dhama K, Pal A. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references – a preliminary study. Ann Med Surg. 2023;85(10):5275-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.McGowan A, Gui Y, Dobbs M, Shuster S, Cotter M, Selloni A, Goodman M, Srivastava A, Cecchi GA, Corcoran CM. ChatGPT and Bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Res. 2023;326:115334. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13(1):14045. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Kim HJ, Yoon PW, Yoon JY, Kim H, Choi YJ, Park S, Moon JK. Discrepancies in ChatGPT's hip fracture recommendations in older adults for 2021 AAOS evidence-based guidelines. J Clin Med. 2024;13(19):5971. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Sanchez-Ramos L, Lin L, Romero R. Beware of references when using ChatGPT as a source of information to write scientific articles. Am J Obstet Gynecol. 2023;229(3):356-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15(4):e37432. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Acut DP, Malabago NolascoK, Malicoban ElesarV, Galamiton NarcisanS, Garcia MB. “ChatGPT 4.0 Ghosted us While Conducting Literature Search:” modeling the Chatbot’s generated non-existent references using regression analysis. Internet Reference Serv Q. 2025;29(1):27-54. [Google Scholar]
20.Gianola S, Bargeri S, Castellini G, Cook C, Palese A, Pillastrini P, Salvalaggio S, Turolla A, Rossettini G. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for Lumbosacral radicular pain: a cross-sectional study. J Orthopaedic Sports Phys Ther. 2024;54(3):222-8. [DOI] [PubMed] [Google Scholar]

[R1] 1.American Academy of Orthopaedic Surgeons Management of Hip Fractures in Older Adults Evidence-Based Clinical Practice Guideline. American Academy of Orthopaedic Surgeons; https://www.aaos.org/hipfxcpg (2021). [DOI] [PubMed] [Google Scholar]

[R2] 2.Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Jt Surg. 2023;105(19):1519-26. [DOI] [PubMed] [Google Scholar]

[R3] 3.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-40. [DOI] [PubMed] [Google Scholar]

[R4] 4.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med. 2023;388(13):1233-9. [DOI] [PubMed] [Google Scholar]

[R5] 5.Wrenn SP, Mika AP, Ponce RB, Mitchell PM. Evaluating ChatGPT's ability to answer common patient questions regarding hip fracture. J Am Acad Orthop Surg. 2024;32(14):656-659. [DOI] [PubMed] [Google Scholar]

[R6] 6.Nietsch KS, Shrestha N, Mazudie Ndjonko LC, Ahmed W, Mejia MR, Zaidat B, Ren R, Duey AH, Li SQ, Kim JS, Hidden KA, Cho SK. Can large language models (LLMs) predict the appropriate treatment of acute hip fractures in older adults? Comparing appropriate use criteria with recommendations from ChatGPT. J Am Acad Orthop Surg Glob Res Rev. 2024;8(8):e24.00206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-Generated medical content. Cureus. 2023;15(5):e39238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.GPT-4o mini: advancing cost-efficient intelligence | OpenAI. Available at: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed July 25, 2025.

[R9] 9.Venditti B. The 10 most-used AI Chatbots in 2025. Visual capitalist; 2025. Available at: https://www.visualcapitalist.com/the-10-most-used-ai-chatbots-in-2025/. Accessed December 1, 2025.

[R10] 10.Ahmed W, Saturno M, Rajjoub R, Duey AH, Zaidat B, Hoang T, Restrepo Mejia M, Gallate ZS, Shrestha N, Tang J, Zapolsky I, Kim JS, Cho SK. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J. 2024;33(11):4182-203. [DOI] [PubMed] [Google Scholar]

[R11] 11.Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LCM, Shrestha N, Rajjoub R, Ahmed W, Hoang T, Saturno MP, Tang JE, Gallate ZS, Kim JS, Cho SK. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23(11):1684-91. [DOI] [PubMed] [Google Scholar]

[R12] 12.Chelli M, Lavou&eacute V, Trojani C, Azar M, Deckert M, Raynier JL, Clowez G, Boileau P, Ruetsch-Chelli C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis. J Med Internet Res. 2024;26(3):e53164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Sharun K, Banu SA, Pawde AM, Kumar R, Akash S, Dhama K, Pal A. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references – a preliminary study. Ann Med Surg. 2023;85(10):5275-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.McGowan A, Gui Y, Dobbs M, Shuster S, Cotter M, Selloni A, Goodman M, Srivastava A, Cecchi GA, Corcoran CM. ChatGPT and Bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Res. 2023;326:115334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13(1):14045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Kim HJ, Yoon PW, Yoon JY, Kim H, Choi YJ, Park S, Moon JK. Discrepancies in ChatGPT's hip fracture recommendations in older adults for 2021 AAOS evidence-based guidelines. J Clin Med. 2024;13(19):5971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Sanchez-Ramos L, Lin L, Romero R. Beware of references when using ChatGPT as a source of information to write scientific articles. Am J Obstet Gynecol. 2023;229(3):356-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15(4):e37432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Acut DP, Malabago NolascoK, Malicoban ElesarV, Galamiton NarcisanS, Garcia MB. “ChatGPT 4.0 Ghosted us While Conducting Literature Search:” modeling the Chatbot’s generated non-existent references using regression analysis. Internet Reference Serv Q. 2025;29(1):27-54. [Google Scholar]

[R20] 20.Gianola S, Bargeri S, Castellini G, Cook C, Palese A, Pillastrini P, Salvalaggio S, Turolla A, Rossettini G. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for Lumbosacral radicular pain: a cross-sectional study. J Orthopaedic Sports Phys Ther. 2024;54(3):222-8. [DOI] [PubMed] [Google Scholar]

PERMALINK

ChatGPT-4o Mini Fabricates and Miscites Evidence for American Academy of Orthopaedic Surgeons Hip Fracture Clinical Practice Guidelines

David McCavitt, BA

Soroush Shabani, BS

Ashley Mulakaluri, BA

Sahil Dhandi, BS

Andrew Duong, BS

Joseph T Patterson, MD, FACS, FAAOS

Abstract

Background:

Methods:

Results:

Conclusions:

Level of Evidence:

Introduction

Materials and Methods

Study Design

Data Collection

Response Assessment

Statistical Analysis

Results

TABLE I.

TABLE II.

TABLE III.

Discussion

Conclusion

Appendix

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

ChatGPT-4o Mini Fabricates and Miscites Evidence for American Academy of Orthopaedic Surgeons Hip Fracture Clinical Practice Guidelines

David McCavitt, BA

Soroush Shabani, BS

Ashley Mulakaluri, BA

Sahil Dhandi, BS

Andrew Duong, BS

Joseph T Patterson, MD, FACS, FAAOS

Abstract

Background:

Methods:

Results:

Conclusions:

Level of Evidence:

Introduction

Materials and Methods

Study Design

Data Collection

Response Assessment

Statistical Analysis

Results

TABLE I.

TABLE II.

TABLE III.

Discussion

Conclusion

Appendix

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases