Skip to main content
Clinical Orthopaedics and Related Research logoLink to Clinical Orthopaedics and Related Research
. 2024 Mar 1;482(4):578–588. doi: 10.1097/CORR.0000000000002995

How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information

Oscar Y Shen 1,2, Jayanth S Pratap 1,3, Xiang Li 4, Neal C Chen 1, Abhiram R Bhashyam 1,
PMCID: PMC10936961  PMID: 38517757

Abstract

Background

The lay public is increasingly using ChatGPT (a large language model) as a source of medical information. Traditional search engines such as Google provide several distinct responses to each search query and indicate the source for each response, but ChatGPT provides responses in paragraph form in prose without providing the sources used, which makes it difficult or impossible to ascertain whether those sources are reliable. One practical method to infer the sources used by ChatGPT is text network analysis. By understanding how ChatGPT uses source information in relation to traditional search engines, physicians and physician organizations can better counsel patients on the use of this new tool.

Questions/purposes

(1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic’s level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT?

Methods

We evaluated three search queries relating to orthopaedic conditions: “What is the cause of carpal tunnel syndrome?,” “What is the cause of tennis elbow?,” and “Platelet-rich plasma for thumb arthritis?” These were selected because of their relatively high, medium, and low consensus in the medical evidence, respectively. Each question was posed to ChatGPT version 3.5 and version 4.0 20 times for a total of 120 responses. Text network analysis using term frequency–inverse document frequency (TF-IDF) was used to compare text similarity between responses from ChatGPT and Google Search. In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a key word to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google search result to assess their text similarity to each other. In this way, text similarity can be used to infer relative content similarity. To answer our first question, we characterized the text similarity between ChatGPT and Google Search responses by finding the TF-IDF scores of the ChatGPT response and each of the 20 Google Search results for each question. Using these scores, we could compare the similarity of each ChatGPT response to the Google Search results. To provide a reference point for interpreting TF-IDF values, we generated randomized text samples with the same term distribution as the Google Search results. By comparing ChatGPT TF-IDF to the random text sample, we could assess whether TF-IDF values were statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity. To answer our second question, we classified the Google Search results to better understand sourcing. Google Search provides 20 or more distinct sources of information, but ChatGPT gives only a single prose paragraph in response to each query. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven by one of four source categories: academic, government, commercial, or material that took the form of a scientific manuscript but was not peer-reviewed or indexed on a government site (such as PubMed). We then compared the TF-IDF similarity between ChatGPT responses and the source category. To answer our third research question, we repeated both analyses and compared the results when using ChatGPT 3.5 versus ChatGPT 4.0.

Results

The ChatGPT response was dominated by the top Google Search result. For example, for carpal tunnel syndrome, the top result was an academic website with a mean TF-IDF of 7.2. A similar result was observed for the other search topics. To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have a mean TF-IDF of 2.7 ± 1.9, controlling for text length and keyword distribution. The observed TF-IDF distribution was higher for ChatGPT responses than for random text samples, supporting the claim that keyword text similarity is a measure of relative content similarity. When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search. For the subject where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.6 versus 0.2). ChatGPT 4.0 had higher text similarity to Google Search results than ChatGPT 3.5 (mean increase in TF-IDF similarity of 0.80 to 0.91; p < 0.001). The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics.

Conclusion

ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to platelet-rich plasma, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0.

Clinical Relevance

Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response, while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.

Introduction

ChatGPT (OpenAI) has multiple potential applications in medicine [8, 11, 18, 20] and is increasingly used as a source of medical information in addition to, or as a replacement for, traditional search engines [16]. Currently, the most widely used method to obtain general medical information is through search engines such as Google Search. ChatGPT differs from Google in some distinct ways. First, unlike Google, the answer that ChatGPT provides differs each time, even if the question posed to it remains the same. Second, although Google retrieves information that is already available on the internet and provides a list of links and sources, ChatGPT functions by generating a response to a query based on a large language model (LLM). An LLM is an algorithm that predicts what words are most likely to appear in a response by using the query to direct attention across information that is already readily available on the internet [18, 21]. ChatGPT and Google present information about existing data on the internet, but in different forms: ChatGPT creates a readable summary based on the body of information available on the internet, whereas Google Search directs readers to the most popular websites that have content relevant to the search query, using that same body of internet data.

This raises a number of interesting questions. For example, how similar are responses from ChatGPT and Google Search, and can this similarity give some insight into sources of data, especially for topics where there is less information available? Google makes its sources explicit, but the structure of LLMs makes it impossible to determine what sources ChatGPT uses to formulate its responses. In addition, direct requests may be susceptible to artificial intelligence hallucinations [2]. Understanding the differences between ChatGPT and Google can help physicians explain the benefits and limitations of each search tool to patients who use them to obtain medical information. Furthermore, because ChatGPT is still in development, it would be helpful for physicians to anticipate how future versions might change by comparing an older version to a newer version. To answer these questions, we felt that text network analysis could provide insight where it is otherwise difficult or impossible to obtain.

Therefore, we asked: (1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic’s level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT?

Materials and Methods

In this study, we used repeated, identical queries of ChatGPT and posed the same query in Google Search to explore the relationship of ChatGPT’s answers to the top 20 Google Search results using a text network analysis (Fig. 1).

Fig. 1.

Fig. 1

This is a summary diagram of the study from data collection to analysis. TF-IDF = term frequency–inverse document frequency. A color image accompanies the online version of this article.

The following three queries were used: “What is the cause of carpal tunnel syndrome?,” “What is the cause of tennis elbow?,” and “Platelet-rich plasma for thumb arthritis?” We chose these three queries because the cause of carpal tunnel syndrome has a strong consensus, the cause of tennis elbow has a medium consensus, and the use of platelet-rich plasma (PRP) for thumb arthritis has a low consensus in the orthopaedic community.

Data Collection

ChatGPT

Each question was entered verbatim into GPT 3.5 20 times, each time in a new chat to simulate a new patient inquiry. The responses were recorded verbatim and labeled (Fig. 2). This was also repeated for GPT 4.0, which is the latest version [19]. We chose to examine ChatGPT 3.5 because this was the free version that is available to the public, whereas ChatGPT 4.0 is available with a subscription. We also felt that by comparing version 3.5 and version 4.0, we would be able to assess the robustness of our results to see whether the findings are consistent in the LLM as opposed to being dependent on a particular “build” or version of ChatGPT.

Fig. 2.

Fig. 2

Pictured here is an example ChatGPT response for the query, “What is the cause of carpal tunnel syndrome?”

Google Search

The questions were also entered verbatim into Google without the user being logged in to avoid personalization of results, and the top 20 search results were identified. To assess sourcing, the Google Search results were classified into four categories for comparison: academic (website from an academic institution or organization, including peer-reviewed PubMed articles), government (website sponsored or run by the government, including PubMed), commercial (any other type of website), or form of a scientific manuscript (here, we defined these sources as nonindexed scholarly journals, preprint server content, or non-peer-reviewed articles or reviews; indexed journals were captured as academic content). Text from each website was included or excluded with the goal of imitating the structure of ChatGPT responses. To compare the website text with the ChatGPT response, any website sections containing an overview or explanation of the condition were included. For websites that provided content in the form of scientific manuscripts but were not indexed on PubMed, the discussion section was taken, if the article was available for free. Otherwise, the abstract or background section was used. Of note, most Google Search results in the “form of scientific manuscript” were from nonindexed journals, preprint server content, and non-peer-reviewed articles or reviews.

Text Network Analysis

Text network analysis was used to compare the text similarity between ChatGPT and Google responses. This began with calculating a term frequency–inverse document frequency (TF-IDF) score. Term frequency is a measure of how frequently a term occurs in a document. Inverse document frequency diminishes the weight of terms that occur very frequently in the dataset and increases the weight of rare terms. Multiplying TF by IDF generates the TF-IDF score—a measure of the importance of a word to a text, while ensuring that commonly used words in general language are not overrepresented in determining the similarity between texts.

In a text network graph, each website or response is considered a node. A text network looks at a connection between one node and another node. The strength of a connection between two nodes is calculated by taking their TF-IDF scores and calculating a similarity metric (Fig. 3). A text network analysis provides insights into how similar the words used between two texts are. Subsequently, if the strength of text connections is high between a ChatGPT response to a Google Search source classification, the source of the ChatGPT response can be inferred. All websites and responses were included in our text network analysis, adjusting for the fact that ChatGPT was queried multiply (Fig. 3).

Fig. 3.

Fig. 3

This is the aggregation method for calculating the average similarity for ChatGPT responses. Each ChatGPT response is represented as a node in the network. The text sample from each website is also represented as a node. The neighborhood of a ChatGPT node is defined as the set of all website nodes it is connected to. For each ChatGPT node, we calculated the similarity with each website in its neighborhood and took the average across all ChatGPT nodes. This results in a list of similarity scores for each website, averaged across all ChatGPT responses.

To provide a reference point for interpreting TF-IDF values, we generated randomized text samples following the same term distribution as the Google Search results (Supplemental Digital Content 1; http://links.lww.com/CORR/B271). By comparing ChatGPT TF-IDF to the random text sample, we can assess whether TF-IDF values are statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity [1].

Text Normalization and Stop Words Removal

The standard steps for a text network analysis are text normalization, stop words removal, and text-to-network conversion. These steps are necessary when comparing text segments to identify key content words that are most meaningful to a text. Each ChatGPT response and each Google Search result was prepared in the same fashion using a common text networking package in R. Text normalization cleans the text to remove any differences in tense or phrasing. This is done by converting all words into their lemmas—a form of a word that is used to represent all possible forms. For example, “run” is the lemma for “ran,” “runs,” and “running.” This reduces variations of the same word because they have the same meaning and allows for more useful comparisons between different texts. Syntax information was also removed because it is not used when performing a text network analysis. Stop words that function as liaisons and do not contribute to the meaning of a sentence (such as “a,” “and,” and “the”) were removed. Numbers were removed from the text as well.

Text-to-network Conversion

Text-to-network conversion is the process of converting the processed text into a visualized text network. The strength of association between texts is scored by calculating the TF-IDF for each website and response [22]. Then, the TF-IDF vectors are compared among texts using the sum of TF-IDF for all key content words. Finally, these scores are converted into a text network by representing each text sample (both the sources and the ChatGPT responses) as nodes and mapping the pairwise similarities as connections in a broader network.

To remove weak connections and improve visualization, the network is then pruned (α = 0.25) and formatted using bold lines and color. Bolder lines are used between notes with higher TF-IDF scores to indicate greater similarity. A preprogrammed function in R (Leiden community detection) is used to colorize nodes automatically based on TF-IDF similarity.

Aggregate Response Analysis to Compare ChatGPT and Google Responses

We compared how similar the “average” ChatGPT response to a particular query was to each of the top 20 Google Search results. To account for variability in ChatGPT responses, we aggregated responses for each query using the sum of the TF-IDF similarities between the ChatGPT responses and Google Search result for each query (Fig. 3). This allowed us to see how similar the average ChatGPT response was to each website.

Aggregate Response Analysis to Compare ChatGPT and Google Source Type

Similarly, we compared how similar the “average” ChatGPT response to a particular query was to the source category of the Google Search result (academic, governmental, commercial, or form of a scientific manuscript). To calculate summed TF-IDF for types of sources, all websites of a single category were treated as a single text. The text network analysis was then repeated using four categories rather than considering each Google result individually. Aggregate response analysis was the same except instead of comparing a ChatGPT response with an individual website, it was compared with all the websites in a specific category.

Primary and Secondary Study Outcomes

Primary Study Outcome: ChatGPT 3.5 and Google Search Similarity

Text network analysis with TF-IDF was used to compare text similarity between responses from ChatGPT and Google Search for queries related to topics in orthopaedic surgery.

In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a keyword to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, content word text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google Search result to assess their text similarity to each other. We then created a text network, which is a mathematical method to describe relationships between objects. Using TF-IDF similarity, we visualized these relationships and made qualitative insights (Fig. 3). These methods are described in detail below.

Secondary Study Outcomes

Our secondary question was to determine whether the distribution of sources for Google Search (academic, governmental, commercial, or form of a scientific manuscript) varied based on the topic’s level of medical consensus. Google Search provides 20+ distinct sources of information, but ChatGPT gives only a single prose paragraph. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven from one of the four source categories. We then compared the TF-IDF similarity between ChatGPT responses and the source category.

To answer our third question, we repeated all of our analyses and compared the results when using ChatGPT version 3.5 versus version 4.0.

Ethical Approval

Our study did not involve human or animal data and required no ethical board approval.

Statistical Analysis

All text processing and network analysis was performed using R (Version 4.2.2) in RStudio (Version 2022.12.0+353) platform and Python (Version 3.11.0) using the textnets library (version 0.1.1 and 0.8.7, for R and Python respectively) [3]. The statistical analysis was done in Microsoft Office Excel (2023, version 16.71).

Results

In Terms of Key Content Words, How Similar Are ChatGPT and Google Search Responses?

In general, ChatGPT appeared to use similar sources to Google Search. The ChatGPT response was dominated by the top Google Search result (Fig. 4). The response given by ChatGPT 3.5 for carpal tunnel syndrome was moderately to strongly similar to the Mayo Clinic website (TF-IDF 7.2) with relatively limited variation (standard deviation 1.6). A similar result was observed for the other topics. For tennis elbow, ChatGPT 3.5 was moderately similar to the Mayo Clinic website (TF-IDF 4.3 ± 1.3). For PRP, the answer from ChatGPT 3.5 strongly resembled the Darrow Stem Cell Institute (a private clinic offering stem cell injections) website (TF-IDF 11.5) but had a larger variation of answers from this website than for the carpal tunnel syndrome or tennis elbow queries (SD 2.5). Of note, for PRP, the response from ChatGPT 3.5 did not have a strong resemblance to the American Academy of Orthopaedic Surgeons website, the only source for this topic that is from a recognized orthopaedic organization (TF-IDF 2.7 ± 0.5).

Fig. 4.

Fig. 4

The TF-IDF similarity and text networks for individual websites and aggregate ChatGPT responses for (A-B) carpal tunnel syndrome, (C-D) tennis elbow, and (E-F) platelet-rich plasma are shown. The statistical baseline is calculated from text similarity to a random generated text. AAOS = American Academy of Orthopaedic Surgeons; NIH = National Institutes of Health; AANS = American Association of Neurological Surgeons; ASSH = American Society for Surgery of the Hand; NHS = National Health Service; MNT = Medical News Today; MSK = Memorial Sloan Kettering; ACR = American College of Rheumatology; MSSPC = Michigan Surgery Specialists, P.C.; OIoP = Orthopaedic Institute of Pennsylvania; UConn = University of Connecticut; CCOHS = Canadian Centre for Occupational Health and Safety; HSE = Health and Safety Executive; ROSM = Regenerative Orthopedics & Sports Medicine. A color image accompanies the online version of this article.

To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have an average TF-IDF of 2.66 ± 1.90, controlling for text length and term distribution. The observed TF-IDF distribution was statistically significantly higher for ChatGPT responses than for random text samples—this finding supports the claim that keyword text similarity is a measure of relative content similarity (p < 0.001).

Does the Source Distribution Differ for Google Search Responses Based on the Topic’s Level of Medical Consensus?

For topics such as carpal tunnel syndrome and tennis elbow, which have been well studied and for which there is high or medium consensus in the medical community, there are a high number of academic and government websites that are highly recommended by Google (Table 1). For PRP, there are only limited data and limited consensus, which results in Google providing almost exclusively commercial sources. The only academic source for PRP was a summary of a recently published scientific article, rather than a traditional explanation of the condition and treatment similar to those available for carpal tunnel syndrome and tennis elbow.

Table 1.

Categories of the top 20 Google Search results for each query

Academic Government Commercial Manuscript
Carpal tunnel syndrome 9 6 5 0
Tennis elbow 8 4 8 0
Platelet-rich plasma 1 0 11 8

Academic: websites belonging to an academic institution or society (such as a university or American Academy of Orthopaedic Surgeons). Government: websites belonging to a government entity (such as the National Institutes of Health, or the United Kingdom National Health Service). Commercial: websites belonging to a private organization (such as a private practice or WebMD). Manuscript: websites of a published work of research.

When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search (Fig. 5). For subjects where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.64 versus 0.2).

Fig. 5.

Fig. 5

The TF-IDF similarity and text networks for categorized websites and aggregate ChatGPT response for (A-B) carpal tunnel syndrome, (C-D) tennis elbow, and (E-F) PRP are shown. The statistical baseline is calculated from text similarity to a random generated text. A color image accompanies the online version of this article.

Do These Results Vary Across Different Versions of ChatGPT?

ChatGPT 4.0 had a higher TF-IDF similarity than ChatGPT 3.5 for individual and categorized websites across all questions. ChatGPT 4.0's TF-IDF similarity had a mean increase of 0.80 (95% confidence interval 0.42 to 1.18; p < 0.001,), 0.91 (95% CI 0.41 to 1.40; p < 0.001), and 0.83 (95% CI 0.44 to 1.22; p < 0.001), for carpal tunnel syndrome, tennis elbow, and PRP, respectively. The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics.

Discussion

As ChatGPT becomes more widely used by health seekers as a source of medical information, physicians will need to develop a basic understanding of its advantages and limitations to guide patients. Although it is well known that ChatGPT responses can be variable and potentially unreliable, it is impossible to determine what sources ChatGPT relies on when generating responses. In fact, ChatGPT users likely have experienced false claims and factitious references, called hallucinations, when asking for sources or reference material [2]. For this reason, we used text network analysis to better understand the potential sourcing of ChatGPT responses. We found that ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for conditions such as carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to PRP, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0.

Limitations

This study has several limitations inherent to text network analysis using TF-IDF. Although TF-IDF is a generally accepted method to measure text similarity as an inference on content similarity [7, 17], the nuance needed to understand medical texts may be lost in TF-IDF because it only identifies the presence of keywords without considering its context or meaning. Therefore, despite a high TF-IDF similarity to a given website, the message conveyed by ChatGPT could still be different. To offset this limitation, we compared with a distribution of randomly generated text samples and found a difference. This supports our use of text similarity as a marker of content similarity. Instead of TF-IDF, alternate analytic options could include Word2Vec or BERT vectorization. These methods may preserve semantic information and identify important relationships in medical statements, but this has not been conclusively validated [9, 12, 14]. Another limitation is that ChatGPT’s responses occasionally include tangential topics such as treatment or management of the condition even when not specifically queried. Because we did not include treatment or management information from Google Search websites, the TF-IDF similarity may change with added information. However, even when treatment and management information was provided in ChatGPT, these sections were short compared with the rest of the response, so we expect there was a limited impact on our results.

In Terms of Key Content Words, How Similar Are ChatGPT and Google Search Responses?

In this study, we found that ChatGPT responses were similar to individual Google Search results for queries related to orthopaedic surgery. This implies that ChatGPT and Google likely use the same sources for a specific question. Moreover, the most popular Google Search results appeared to have the strongest influence on ChatGPT, suggesting that patients are likely to have similar understanding of the same topic regardless of which modality they use. Some lower-ranked Google Search results had high text similarity, likely caused by those sources mimicking the style and content of higher-ranked sources. One key difference, however, is that patients using Google are likely more aware that their information comes from different sources and can thereby make a judgment on the trustworthiness of those sources. Taken together, our results help to explain how ChatGPT may provide answers if used as an artificial intelligence–based chatbot for medicine in relation to a Google Search [6, 10, 13]. However, when used in that context, further investigation is needed to assess the readability and accuracy of ChatGPT compared with other websites to fully comprehend the quality of presented information. If ChatGPT is inaccurate or hard to understand, it could have negative implications, especially for those with limited health literacy [15].

Does the Source Distribution Differ for Google Search Responses Based on the Topic's Level of Medical Consensus?

In this study, we found that the distribution of source information varied based on the relative level of medical consensus on a topic. For example, for conditions with a widely accepted medical consensus, ChatGPT responses were more similar to widely published academic and government sources. When fewer academic or government sources were available, especially for topics with lower medical consensus, ChatGPT responses were more likely to include information from a small number of nonacademic available sources. These are settings where it may be more helpful for physician organizations to provide summaries, if only to note that the evidence is limited, because this would provide ChatGPT with more reliable sources to anchor its responses. These findings especially enforce the importance of physicians engaging in the development and oversight of this technology as it develops more clinical applications—including engaging with technology companies through advisory boards, helping to create medicine-specific LLM models, and helping design which results are presented when LLMs are used for medical queries [5].

Do These Results Vary Between Different Versions of ChatGPT?

ChatGPT is in constant development, and one concern could be that our results were spurious based on ChatGPT version. This was not the case. Our findings persisted for all questions even as ChatGPT was updated from version 3.5 to version 4.0. Other studies comparing ChatGPT versions found similar results, and performance on surgery topics was noticeably inferior in quality and accuracy to that observed in other medical fields [4, 23]. The most important aspect of this finding is that it suggests that our results are endogenous to the use of LLMs for medical queries, as opposed to being dependent on a particular “build” or version of ChatGPT.

Conclusion

Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT uses less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.

Supplementary Material

abjs-482-578-s001.docx (21KB, docx)

Footnotes

The first two authors contributed equally to this manuscript.

Each author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.

All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.

Ethical approval was not sought for this study.

This work was performed at Massachusetts General Hospital, Boston, MA, USA.

Contributor Information

Oscar Y. Shen, Email: oscar.shen11@gmail.com.

Jayanth S. Pratap, Email: jaypratap@college.harvard.edu.

Xiang Li, Email: XLI60@mgh.harvard.edu.

Neal C. Chen, Email: NCHEN1@partners.org.

References

  • 1.Aizawa A. An information theoretic perspective of TF-IDF measures. Information Processing & Management. 2003;39:45-65. [Google Scholar]
  • 2.Ariyaratne S, Iyengar KP, Nischal N, Babu NC, Botchu R. A comparison of ChatGPT-generated articles with human-written articles. Skeletal Radiol. 2023;52:1755-1758. [DOI] [PubMed] [Google Scholar]
  • 3.Boy J. Textnets: a Python package for text analysis with networks. Journal of Open Source Software. 2020;5:2594. [Google Scholar]
  • 4.Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13:16492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Crossnohere NL, Elsaid M, Paskett J, Bose-Brill S, Bridges JFP. Guidelines for artificial intelligence in medicine: literature review and content analysis of frameworks. J Med Internet Res. 2022;24:e36823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Garcelon N, Neuraz A, Benoit V, et al. Finding patients using similarity measures in a rare diseases-oriented clinical data warehouse: Dr. Warehouse and the needle in the needle stack. J Biomed Inform. 2017;73:51-61. [DOI] [PubMed] [Google Scholar]
  • 8.Grant N, Metz C. A new chat bot is a ‘code red’ for Google’s search business. The New York Times. Available at: https://www.nytimes.com/2022/12/21/technology/ai-chatgpt-google-search.html. Accessed January 8, 2024. [Google Scholar]
  • 9.Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;36:1234-1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233-1239. [DOI] [PubMed] [Google Scholar]
  • 11.Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health. 2023;5:e105-e106. [DOI] [PubMed] [Google Scholar]
  • 12.Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409. [DOI] [PubMed] [Google Scholar]
  • 13.Menichetti J, Hillen MA, Papageorgiou A, et al. How can ChatGPT be used to support healthcare communication research? Patient Education and Counseling. 2023;115:107947. [Google Scholar]
  • 14.Minarro-Giménez JA, Marín-Alonso O, Samwald M. Exploring the application of deep learning techniques on medical text corpora. Stud Health Technol Inform. 2014;205:584-588. [PubMed] [Google Scholar]
  • 15.Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868. [DOI] [PubMed] [Google Scholar]
  • 16.Moons P, Van Bulck L. ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs. 2023;22:e55-e59. [DOI] [PubMed] [Google Scholar]
  • 17.Naderi H, Madani S, Kiani B, Etminani K. Similarity of medical concepts in question and answering of health communities. Health Informatics J. 2020;26:1443-1454. [DOI] [PubMed] [Google Scholar]
  • 18.Open AI. ChatGPT: optimizing language models for dialogue. Available at: https://openai.com/blog/chatgpt/. Accessed March 2, 2023.
  • 19.Open AI. GPT-4 Technical Report. Available at: https://cdn.openai.com/papers/gpt-4.pdf. Accessed March 2, 2023.
  • 20.Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5:e107-e108. [DOI] [PubMed] [Google Scholar]
  • 21.Schade M. How ChatGPT and our language models are developed. Available at: https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed. Accessed March 2, 2023.
  • 22.Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972;28:11-21. [Google Scholar]
  • 23.Taloni A, Borselli M, Scarsi V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13:18562. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Clinical Orthopaedics and Related Research are provided here courtesy of The Association of Bone and Joint Surgeons

RESOURCES