JAMA recently ran an op-ed on the importance (and challenges) of good medical communication [2], alongside a starry-eyed thesis about the promise of generative artificial intelligence—tools like ChatGPT—to serve as a transformative change agent in healthcare. For reasons that are no doubt obvious, I don’t usually spend a lot of time recommending that readers turn to other journals. And even if you’re inclined to check those out, I’d suggest you first read the Editor’s Spotlight/Take 5 article in this month’s Clinical Orthopaedics and Related Research® [10]. It will inform your interpretation of those other perspective pieces, and it will put you two jumps ahead of anyone who only reads JAMA.
The CORR® article I’m featuring here, “How Does ChatGPT Use Source Information Compared to Google? A Text Network Analysis of Online Health Information,” written by faculty members at Harvard University and The Chinese University of Hong Kong, is a don’t-miss cautionary tale on a key topic [10].
As you know, when patients (or we) do a Google search, Google reports where its answers came from. This allows the searcher at least some chance to interrogate that source. Is it likely to be trustworthy? What interests (besides those of the searcher) might it be advancing? You get the idea.
As you may or may not know, ChatGPT responds to queries using plain-language text, as though a friend were typing you a paragraph or two in response to a question you asked him or her. But no sources are provided, and it’s not clear how or whether multiple sources might have been used in combination to answer the question posed.
And ChatGPT may not be a friend. We learn in this month’s CORR that the kinds of sources ChatGPT uses vary based on the level of medical certainty that prevails on the topic being explored. In an ingenious experiment, Dr. Abhiram Bhashyam’s group used text network analysis to uncover ChatGPT’s likely sources [10], and they found that for searches on topics where answers are generally agreed-upon—their paradigm question for that was about the cause of carpal tunnel syndrome—ChatGPT tended to rely on generally reliable (academic) sources. By contrast, when asked about more-controversial topics, such as the use of platelet-rich plasma for thumb arthritis, ChatGPT leaned on a greater diversity of much lower-quality commercial sources. And none of this would be clear to, or even discoverable by, even the most-thoughtful consumer of web-based content. As you’ll see from reading Dr. Bhashyam’s paper, it requires a considerable amount of analytic oomph to sort out where ChatGPT found its answers.
This is not another paper warning you that what you read on the internet may be unreliable. We all know that. It’s a package of keen insights about just when the search findings delivered by a widely used generative AI tool should not be trusted.
And while the JAMA pieces I mentioned earlier are likely to make a splash, if you read those—and other coverage of the medical applications of generative AI—without first checking out the insights in this month’s Editor’s Spotlight article [10] and in the Take 5 interview with Dr. Bhashyam that follows, you’re likely to be, well, all wet.
Take Five Interview with Abhiram R. Bhashyam MD, PhD, senior author of “How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information”
Seth S. Leopold MD: Congratulations on your thoughtful (and methodologically intense!) article. I found it eye-opening. Can you offer surgeons reading this a little advice based on your discoveries that they can share with patients who might use ChatGPT for medical searches, and perhaps a little more advice in case they might want to use it themselves for professional/medical purposes?
Abhiram R. Bhashyam MD, PhD: Thank you. This was a fun topic to investigate, and we hope that it’ll be clinically useful and helpful to patients and doctors, even if the methodology seems complicated.
Patients are going to use the internet to search for medical information. What we can do as surgeons is to help patients get a sense of context. We started this study when we realized that it’s hard to figure out what sources ChatGPT uses to create a response. Even if you ask directly, you may get an inaccurate or false response known as a “hallucination” [1, 6]. So other approaches are needed to get that information.
Some findings from our study that can help surgeons give advice or use ChatGPT are:
In general, ChatGPT and Google likely use the same sources for a specific question. So, overall, the information provided is likely to be similar.
ChatGPT can draw upon multiple sources to create one aggregate response, while Google “shows” you its sources but the response isn’t as user-friendly [13]. And, again, trying to get source information for a ChatGPT response is nearly impossible regardless of whether a patient or surgeon is making the query, because if you query ChatGPT for sources, you are likely to encounter AI hallucinations [1, 6].
For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources. In these settings, physicians should take some extra time to educate patients on the topic or provide resources that give more reliable information.
Dr. Leopold: I don’t doubt that AI is going to transform healthcare; a good argument can be made that it already has, in so many ways. I was surprised, though, that the recent JAMA essay was so all-in on generative AI specifically [13]. How should discoveries like yours inform our perspectives on the potential of generative AI to influence our profession and the people whom we treat?
Dr. Bhashyam: I think generative AI has a lot of promise, especially because it is easy to use and has direct-to-consumer delivery [13]. But most widely used large language models (LLM) like ChatGPT were not designed for medicine; rather, they were developed for general use [6]. When providing medical information to patients, we know that good communication about uncertainty is important because inaccurate information can have both immediate and long-term effects [2]. Because most widely used LLMs were developed for general use, in medical settings they may unintentionally make obvious or subtle mistakes [13].
In our study, we tried to address this topic systematically and in a data-driven way. At the big-picture level, ChatGPT responses are quite similar to what you’d find with a regular Google search. And in general, that information is reliable for topics with a high level of medical consensus. But for controversial or nuanced topics, our study reinforces the notion that medical information should probably be verified by clinicians with relevant expertise [2].
Dr. Leopold: The other essay that ran in that issue of JAMA was about medical communication, and it was full of great tips [2]. But it contained nothing about the elephant in the room: Most patients come to the office informed by internet health-information searches, and as often as not (and whether or not we realize it), those searches are the point of departure for any communication we may try to have. How have your findings changed the way you communicate with patients, in light of that?
Dr. Bhashyam: I loved the article by Cappola et al. [2] because it is so relevant to the emergence of generative AI in medicine and the potential role of medical chatbots used for consultation [2, 6]. Medical communication has historically been integral to the clinician-patient relationship, but we now have to consider the role of internet searches and social media, too. Cyberchondria is real. Repeated internet searches regarding medical information can result in increased health anxiety, functional impairment, and increased healthcare utilization [7]. Although they didn’t explicitly discuss it, the framework presented by Cappola et al. [2] is helpful even when patients get information from the internet. Here’s how I apply it when communicating with patients:
Message: I explain to patients that the information they get from ChatGPT or a Google search is likely to be the same. Not everything may be accurate, and I always offer to clarify.
Messenger: At this point, I assume patients are going to look up information on the internet. For nuanced or complicated topics, I explain that ChatGPT or Google Search can sometimes give imperfect answers and that medical jargon doesn’t translate well into plain English. I’ll usually direct the patients toward society-based recommendations or articles I find helpful. In settings of uncertainty, I try to become the “messenger” rather than leaving it to the internet to fill in gaps in understanding.
Social context: In a clinical context, I tend to warn people the most about social media—every patient and every injury is different. A patient with a simple olecranon fracture is unlikely to have the same care journey as a patient with a comminuted proximal ulna fracture-dislocation.
Dr. Leopold: What implications do your discoveries have for our professional societies? Can you make some specific, practical recommendations for the AAOS or our subspecialty associations?
Dr. Bhashyam: This is a great question. Individual practicing physicians and surgeons have so many daily demands, and it is undoubtedly difficult for individual practices to constantly review, prepare, and deliver patient information. As LLMs become increasingly used tools for diagnosing and treating patients, we need to ensure that we aren’t spreading false information, inequity, or bias. Physician organizations can help in the responsible rollout of these tools because they often act as unified voices that connect physicians, patients, governments, and commercial interests [11]. Some individual institutions and organizations already are helping to drive development and deployment of AI tools, but personally, I think our professional societies need to have a role as well because they are the outward representation of us as surgeons to the public [9].
Here are some practical recommendations for the AAOS and subspecialty associations based specifically on the results of our study:
Organize across medical societies to create more curated websites for orthopaedic conditions. Make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.
-
Engage with technology companies through advisory boards to:
Create medicine and subspeciality specific LLM models, and
Help design how results are presented when LLMs are used for medical queries.
Engage with regulatory agencies to help protect patients and inform physicians as these technologies evolve [13].
Dr. Leopold: You’re not the only group to have raised concerns about problems with generative AI. There have been some fairly dramatic examples of some of these products leaping over ethical barriers, sometimes in ways I can only describe as creepy [8], and even making up not just answers but sources [3]. Conventional search engines have their shortcomings, to be sure, but in light of those problems and others [4], how can we trust generative AI for something as serious as medical searches, and if we can’t trust it now, when might we be able to?
Dr. Bhashyam: “Trust” in the setting of medical searches and information is a complex topic. On the one hand, there are obvious concerns around hallucinations. In medical scenarios, ChatGPT and other LLMs can often present errors or falsehoods (even subtle ones) in such a convincing manner that the person making the query is convinced of the result. Sometimes the LLM can catch its mistakes, but other times it requires a professional with specialized knowledge [6]. From a social context, changing public perception away from false information is hard, especially because false information can spread farther and faster [2, 12].
But in our study, we looked at a much more subtle concern regarding trust—what happens when an LLM gives information that is helpful to patients or a health professional, but there’s no easy way to validate it? And how is a patient or healthcare professional supposed to figure out what primary sources were used (or not used)? Let’s use the example of lateral epicondylitis again. If you ask, “how long should I wait before getting surgery for lateral epicondylitis?,” ChatGPT 4.0 will answer that the decision to proceed with surgery is individualized, and most healthcare providers recommend considering conservative measures for several months before proceeding to surgery. And while that is a reasonable answer, we don’t know what sources it used to formulate that response or if it considered evidence from the latest systematic reviews, which have increasingly found that symptoms do naturally resolve in almost all patients; it just takes a lot longer than we think [5]. Healthcare is dynamic with shifts in understanding and practice that can make the output of AI algorithms outdated or incorrect until they are updated (which is a time- and resource-intensive process) [13]. So when we don’t have access to sources, we don’t know if the AI output is “keeping up.”
I’m not sure when we’ll be able to “fully” trust generative AI searches for medical information, but I suspect it’ll be different for each person. I live in Boston where Salesforce has these great ads all over the airport talking about ethical generative AI for the average person—and it’s a great reminder that there are lots of people with the same concerns working actively on these issues. Compared with initial versions, the newest version of ChatGPT does a much better job including disclaimers and advocating for responsible use of medical information, so I think we’re headed in the right direction.

Abhiram R. Bhashyam MD, PhD
Footnotes
A note from the Editor-In-Chief: In “Editor’s Spotlight,” one of our editors provides brief commentary on a paper we believe is especially important and worthy of general interest. Following the explanation of our choice, we present “Take 5,” in which the editor goes behind the discovery with a one-on-one interview with an author of the article featured in “Editor’s Spotlight.” We welcome reader feedback on all of our columns and articles; please send your comments to eic@clinorthop.org.
The author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
The opinions expressed are those of the writers, and do not reflect the opinion or policy of CORR® or the Association of Bone and Joint Surgeons®.
This comment refers to the article available at: 10.1097/CORR.0000000000002995.
References
- 1.Brameier DT, Alnasser AA, Carnino JM, Bhashyam AR, von Keudell AG, Weaver MJ. Artificial intelligence in orthopaedic surgery: can a large language model “write” a believable orthopaedic journal article? J Bone Joint Surg Am. 2023;105:1388-1392. [DOI] [PubMed] [Google Scholar]
- 2.Cappola AR, Cohen KS. Strategies to improve medical communication. JAMA. 2024;331:70-71. [DOI] [PubMed] [Google Scholar]
- 3.Davis P. Did ChatGPT just lie to me? Available at: https://scholarlykitchen.sspnet.org/2023/01/13/did-chatgpt-just-lie-to-me/. Accessed January 2, 2024.
- 4.IBM. What are AI hallucinations? Available at: https://www.ibm.com/topics/ai-hallucinations. Accessed January 2, 2024.
- 5.Ikonen J, Lähdeoja T, Ardern CL, Buchbinder R, Reito A, Karjalainen T. Persistent tennis elbow symptoms have little prognostic value: a systematic review and meta-analysis. Clin Orthop Relat Res. 2022;480:647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New Engl J Med. 2023;388:1233-1239. [DOI] [PubMed] [Google Scholar]
- 7.Mathes BM, Norr AM, Allan NP, Albanese BJ, Schmidt NB. Cyberchondria: overlap with health anxiety and unique relations with impairment, quality of life, and service utilization. Psychiatry Res. 2018;261:204-211. [DOI] [PubMed] [Google Scholar]
- 8.Roose K. A conversation with Bing’s chatbot left me deeply unsettled. Available at: https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html. Accessed January 2, 2024.
- 9.Ross C. Backed by Mayo Clinic and Microsoft, a nonprofit forms to test AI tools used in health care. Available at: https://www.statnews.com/2024/01/08/ai-tools-health-care-nonprofit-chai-artificial-intelligence/. Accessed January 10, 2024.
- 10.Shen OY, Pratap JS, Li X, Chen NC, Bhashyam AR. How does ChatGPT use source information compared to Google? A text network analysis of online health information. Clin Orthop Relat Res. 2024;482:578-588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Toma A, Senkaiahliyan S, Lawler PR, Rubin B, Wang B. Generative AI could revolutionize health care — but not if control is ceded to big tech. Available at: https://www.nature.com/articles/d41586-023-03803-y. Accessed January 11, 2024. [DOI] [PubMed]
- 12.Vosoughi S, Roy D, Aral S. The spread of true and false news online. Science. 2018;359:1146-1151. [DOI] [PubMed] [Google Scholar]
- 13.Wachter RM, Brynjolfsson E. Will generative artificial intelligence deliver on its promise in health care? JAMA. 2024;331:65-69. [DOI] [PubMed] [Google Scholar]
