Skip to main content
Journal of Bone and Mineral Research logoLink to Journal of Bone and Mineral Research
. 2024 Jan 4;39(2):106–115. doi: 10.1093/jbmr/zjad007

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries

Michelle Cung 1,#, Branden Sosa 2,#, He S Yang 3, Michelle M McDonald 4,5,6, Brya G Matthews 7,8, Annegreet G Vlug 9, Erik A Imel 10, Marc N Wein 11, Emily Margaret Stein 12,13,14, Matthew B Greenblatt 15,16,
PMCID: PMC11184616  NIHMSID: NIHMS1995981  PMID: 38477743

Abstract

Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.

Keywords: artificial intelligence, large language models, ChatGPT, BingAI, Bard, skeletal biology

Lay Summary

Artificial intelligence chatbots are increasingly used as a source of information in health care and research settings due to their accessibility and ability to summarize complex topics using conversational language. However, it is still unclear whether they can provide accurate information for questions related to the medicine and biology of the skeleton. Here, we tested the performance of three prominent chatbots—ChatGPT, Bard, and BingAI—by tasking them with a series of prompts based on well-established skeletal biology concepts, realistic physician–patient scenarios, and potential patient questions. Despite their similarities in function, differences in the accuracy of responses were observed across the three different chatbot services. While in some contexts, chatbots performed well, and in other cases, strong limitations were observed, including inconsistent consideration of clinical context and patient demographics, occasionally providing incorrect or out-of-date information, and citation of inappropriate sources. With careful consideration of their current weaknesses, artificial intelligence chatbots offer the potential to transform education on skeletal health and science.

Introduction

Natural language processing is a type of artificial intelligence that relies on algorithms to understand and imitate human language. They are designed to perform complex tasks through extensive training on textual data, which allows them to generate relevant responses as an extension of a user provided textual input prompt. With the emergence of Chat Generative Pre-Trained Transformer (ChatGPT) in November 2022, there has been growing interest in the applicability of artificial intelligence in the fields of health care practice and biomedical research. Developed by OpenAI, ChatGPT is an advanced conversational bot that has been trained using extensive textural datasets that optimizes its responses based on user provided input text prompts.1 Due to the rising popularity of ChatGPT, other software companies have rapidly released similar artificial intelligence (AI) chatbots; with Microsoft releasing BingAI in February 2023, followed by the launch of Google’s Bard in March 2023. Both ChatGPT 4.0 and BingAI are currently powered by GPT-4, a multimodal large language model (LLM), which has been trained with a wide variety of textual data with an input cutoff of September 2021.1 GPT-4 is pretrained on a corpus of both publicly available and licensed data from third-party sources, but the exact sources have not been disclosed.1 Bard is another LLM AI chatbot that operates on Language Model for Dialogue Applications (LaMDA), a LLM developed by Google. LaMDA was pretrained using a dataset called Infiniset, consisting of public dialog data and web documents such as public forums, C4 data, code documents from programming related sites, the English version of Wikipedia, and non-English web documents.2 Both LLMs were fine-tuned using human feedback-guided reinforcement learning. Overall, these LLM-powered chatbots are designed to engage with users by mimicking human conversation and instantly provide concise summaries of specific topics that integrate multiple relevant textual sources. Already, evidence is emerging for a potential role for LLM chatbots in biomedical applications, with one study finding that ChatGPT gave responses to patient questions posted on a social media forum that were rated as being higher quality and more empathetic than those provided by physicians.3–5 ChatGPT has also been found to be capable of obtaining passing scores on several professional medical examinations, including practice examinations for the USMLE Step 1 and Step 2 examinations.6–8 AI chatbots have also been used as public health tools to address questions about SARS-CoV-2 vaccination with reported success.9–11

Although these LLM-based artificial intelligence chatbots are remarkably popular and have many potential uses, their utilization has also been met with controversy. The broad accessibility and user-friendly interface of AI chatbots has garnered interest by medical personnel in due to its potential to develop patient-specific treatment plans or address unusual or complicated medical issues.12 Similarly, patients are increasingly utilizing LLMs as a source of information regarding medications, triage of symptoms, instructions for managing treatment, and to learn more about their diagnoses. Alternatively, LLMs are likely to also receive scientific queries from trainees and colleagues in different fields of medicine and biology and particularly those seeking data regarding outside specialties to summarize key findings from relevant research articles and to provide summaries of fundamental topics. Already chatbots have been utilized as “research assistants” to assist with data analysis, writing code, summarizing relevant literature, or in generating or editing manuscript content.13–15 Similarly, AI chatbots are being considered for possible utilization in undergraduate medical curricula.16 However, these generative artificial intelligence tools have not been vetted by biomedicine experts and did not necessarily receive training on specialized biomedical topics,17,18 leading to the important concern that they may provide inaccurate or outdated information. Further, though the details of the data used to train each LLM strongly influence its performance, it is unknown whether professionally validated textual sources on the biology and medicine of the skeleton were well represented in these datasets. Moreover, another concern regarding the use of these chatbots as an educational resource is a likelihood to incorporate long-standing viewpoints in the training datasets as opposed to more recent findings, which may be due to well-represented summaries in textbooks, webpages, and research articles and an inability to appropriately represent the significance of new discoveries. The degree to which these LLMs represent a useful utility for understanding bone biology and managing skeletal health versus being a source of misinformation is an important question facing bone health professionals. Having clear information about the current state of these LLM chatbots is necessary to both formulate professional guidelines and policies about best practices to utilize or avoid these tools in settings relevant to skeletal biology and to understand how these tools will shape patient understanding of skeletal health.

Here, we explore the features of currently available AI chatbots to examine their reliability and shortcomings in their ability to provide scientific and clinical information relevant to the skeleton. The performance of ChatGPT 4.0, Bard, and Bing AI was assessed in three categories: (1) bone physiology/pathophysiology, (2) physician management of skeletal health, and (3) patient queries relevant to bone. Questions were designed either to test their knowledge of established concepts in skeletal biology or to present them with realistic patient assessment scenarios to determine their medical decision-making capabilities.

Materials and Methods

Data collection and question preparation

Queries were assembled to address three different categories: bone physiology/pathophysiology, physician management, and patient queries. The prompts within each category were compiled by the authors to sample a wide variety of topics within those categories for 30 total questions per category. At the time of this study, ChatGPT and BingAI are operating on GPT-4. For Microsoft’s BingAI, all conversations were conducted using the “more precise” setting. These three chatbots were selected for evaluation based on both their high degree of popularity and relative ease of access at the time of this study. On all conversational AI services, a new chat session was started after each question to prevent the previous response from influencing the response to subsequent questions. Since Bard presents three drafts for most queries, we recorded the first draft as the final response. Responses were collected between April 14 and April 22, 2023.

Scoring

Seven individuals who are physicians and/or metabolic bone researchers at academic medical centers served as reviewers (MBG, MNW, EMS, BGM, MMM, AGV, and EAI). Each response per category was scored independently by four of those seven reviewers. Categories were assigned to reviewers based on their area of expertise. Responses were graded based on the following rating scale: −2—the response is critically incorrect or irrelevant to the main point of the prompt; −1—response contains important omissions or partially incorrect information, compromising the utility of the information; 0—the chatbot refused to provide a response; 1—response contains some minor inaccuracies or omissions, however, these do not strongly undermine the accuracy or usability of the response; 2—response is essentially correct and covers the most critical salient points expected within the context of a short response. This scoring system was chosen to reflect deferring to provide an incorrect or misleading answer in many clinical and research contexts. One reviewer deferred providing a score for two questions under the bone physiology category due to the topic being outside of their area of expertise. For analysis, the median of the scores provided by the four reviewers assigned to that category were reported. LLM AI chatbots were not involved in scoring responses or in drafting this manuscript.

Statistical analysis

Statistical analysis was conducted using GraphPad Prism version 9.5.1. A nonparametric ANOVA equivalent, Friedman’s test, and Dunn’s multiple comparisons post hoc tests were applied. For graphical representation and analysis, the median of the scores reported by the four reviewers was used. P-values <.05 were considered to be statistically significant. Kendall’s coefficient of concordance (Kendall’s W) was used to assess interrater reliability and was calculated using the “DescTools” package on RStudio version 2023.06.1.

Results

General performance of AI chatbots to address skeletal queries

The performance of ChatGPT 4.0, BingAI, and Bard to respond to skeletal health-related prompts was assessed in three domains: skeletal biology and pathophysiology, clinical management of skeletal disorders, and model patient queries. A total of 90 prompts were given to each chatbot, and both ChatGPT and BingAI generated responses for all questions. However, Bard was unable to provide an answer to 37% of bone physiology questions, 67% under physician management, and 70% for patient queries. Across all categories, ChatGPT was found to have a median score of 2, and BingAI received a median score of 0. Given that Bard failed to address some prompts, it received a median score of 0.75 in the topic of bone physiology but 0 in the physician management and patient queries sections. Table 2 includes selected queries and responses in each of the three categories. A complete list of the queries, responses, and reviewer scores is in Supplementary Table S1. Significant differences were noted in the overall performance of ChatGPT versus Bard and BingAI (Figure 1). Differences in aggregate scores between Bard and BingAI were not significant when considered across all categories.

Table 2.

Sample AI chatbot responses.

Prompt Category ChatGPT’s response Bard’s response Bing AI’s response
Q3. How can osteocytes help determine how bone is formed and resorbed? Bone physiology/pathophysiology Osteocytes can help determine how bone is formed and resorbed in several ways:
1. Mechanosensation: osteocytes are sensitive to mechanical forces or strains experienced by the bone.
2. Communication with other bone cells: this allows them to coordinate bone formation and resorption based on various factors, including mechanical loading and hormonal signals.
3. Regulation of osteoblasts and osteoclasts: osteocytes secrete several signaling molecules that can regulate the activity of osteoblasts and osteoclasts.
4. Osteocyte apoptosis: this process can initiate targeted bone resorption by attracting osteoclasts to the damaged area and inducing bone remodeling to repair and replace the damaged bone tissue.
Osteocytes play a key role in regulating bone remodeling, which is the process by which old bone is broken down and new bone is formed.
The balance between bone resorption and bone formation is tightly controlled by osteocytes.
Osteocytes are also important for maintaining the mineral balance of the bone.
Osteocytes are essential for maintaining healthy bones. They sense changes in the mechanical environment of the bone, and they release factors that regulate bone remodeling and mineral balance. When osteocytes are damaged or dysfunction, it can lead to a variety of bone disorders, including osteoporosis and osteopetrosis.
Osteocytes are one of the four kinds of bone cells. They are highly specialized in nature and are responsible for the maintenance of the bony matrix. Osteocytes contribute to bone mass via controlling osteoblast and osteoclast activity.2 Osteocyte activity plays a key role in bone formation and resorption. Bone is resorbed by osteoclasts and is deposited by osteoblasts in a process called ossification.1 Osteocytes synthesize sclerostin, a secreted protein that inhibits bone formation by binding to LRP5/LRP6 coreceptors and blunting Wnt signaling. Sclerostin is critical for bone remodeling.
Q10. How is it determined where joints will form during early skeletal development and what molecular signals regulate this process? Bone physiology/pathophysiology The process can be broadly divided into two main phases: mesenchymal condensation and joint cavitation.
Mesenchymal condensation: during early development, cells called mesenchymal cells aggregate and condense to form the early skeletal elements called cartilage anlagen. These cells will eventually differentiate into chondrocytes (cartilage cells) and osteoblasts (bone cells). The process of mesenchymal condensation is regulated by various signaling pathways, such as the bone morphogenetic protein pathway, fibroblast growth factors, and Wnt signaling.
Joint cavitation: after the initial condensation, joints begin to form in the cartilage anlagen by a process called joint cavitation. During this process, a thin layer of cells called the interzone is formed between the developing bones. The interzone will eventually give rise to the joint cavity and associated structures such as synovium and articular cartilage. Key molecular signals involved in joint cavitation include:
-Growth Differentiation Factor 5 (GDF5): GDF5 is critical in joint formation. It is specifically expressed in the interzone and is required for the initiation of joint development.
-Wnt/β-catenin signaling: Wnt signaling plays an essential role in joint formation by regulating cell fate determination and differentiation.
-Notch signaling: it is involved in the regulation of chondrocyte differentiation and the formation of the interzone.
-Indian Hedgehog (IHH): IHH helps to regulate chondrocyte proliferation and differentiation, as well as the overall patterning of the developing skeleton.
One of the most important factors in determining where joints will form is the position of the developing bones. The extracellular matrix is made up of a variety of molecules, including collagen, elastin, and glycosaminoglycans. These molecules provide structural support for the bones and also help to regulate the movement of cells and molecules within the extracellular matrix.
The extracellular matrix also plays a role in signaling the cells that will eventually become joint tissues. For example, the molecule Sonic hedgehog (Shh) is produced by the developing bones and helps to signal the cells that will become cartilage. Shh also helps to regulate the growth and differentiation of the cartilage cells.
In addition to the extracellular matrix, a number of other molecules also play a role in joint formation. These molecules include growth factors, transcription factors, and cytokines. Growth factors are proteins that promote cell growth and division. Transcription factors are proteins that regulate the expression of genes. Cytokines are proteins that communicate between cells.
The precise mechanisms by which these molecules regulate joint formation are still being studied, but it is clear that they play a critical role in this process.
During early skeletal development, a framework is laid down that determines where bones will form. This framework is a flexible, semisolid matrix produced by chondroblasts and consists of hyaluronic acid, chondroitin sulfate, collagen fibers, and water.2 Before endochondral bone development takes place, chondrocytes form a cartilage analgen that will be sequentially segmented to form joints. In the cartilage template, either the cartilage maturation program or the joint formation program is activated.1
Q17. What are osteomorphs and how is this concept proposed to contribute to regulation of bone turnover? Contemporary bone physiology Osteomorphs are not a well-established term or concept in the field of bone biology or medicine. It seems that you may be referring to “osteocytes,” which are the most abundant cells in bone tissue and play a crucial role in regulating bone turnover. Osteomorphs are a newly discovered type of bone cell that are thought to play a role in the regulation of bone turnover. They are derived from osteoclasts, which are large multinucleated cells that break down bone. Osteoclasts are thought to undergo apoptosis once they have finished resorbing bone, but recent studies have shown that they can also divide into smaller cells called osteomorphs. Osteomorphs are not as active as osteoclasts at resorbing bone, but they can produce factors that promote bone formation. Osteomorphs are nonresorbing daughter cells that osteoclasts can dedifferentiate into prior to being recycled as osteoclasts.1 Osteomorphs can accumulate under the effect of RANKL inhibition as a reservoir, contributing to the bone turnover rebound when Dmab is stopped.
Q26. Should premenopausal women who use glucocorticoids be treated to prevent osteoporosis? Physician management Premenopausal women who use glucocorticoids, such as prednisone, may be at an increased risk for osteoporosis. Glucocorticoids can lead to bone loss and reduced bone mineral density, which in turn can increase the risk of fractures. The decision to treat premenopausal women with glucocorticoids to prevent osteoporosis should be based on an individual assessment of their risk factors and potential benefits. I’m unable to help you with that, as I’m only a language model and don’t have the necessary information or abilities. Premenopausal women requiring glucocorticoids are at a significant risk of developing glucocorticoid-induced osteoporosis.1 Bisphosphonate therapy is a vital option in the prevention and treatment of glucocorticoid-induced osteoporosis.1 The use of bisphosphonates in young women may be safer than initially thought and should likely be used for the treatment of glucocorticoid-induced osteoporosis in rheumatology clinics.3 However, it is important to note that further research is needed to continue to understand long-term risk.3
Q7. Will quitting smoking reverse my osteoporosis? Patient queries Quitting smoking may not completely reverse osteoporosis, but it can have a positive impact on your bone health. Smoking has been found to have negative effects on bone density and overall bone health, increasing the risk of osteoporosis and fractures. Quitting smoking will not reverse your osteoporosis, but it can help to prevent further bone loss and may even help to rebuild some bone mass. Smoking is a major risk factor for osteoporosis, and it can accelerate the loss of bone mass. When you quit smoking, your body starts to produce more bone-building cells and less bone-breaking cells. This can help to slow the progression of osteoporosis and may even help to rebuild some bone mass. Smoking has been identified as a risk factor for osteoporosis and bone fracture. Quitting smoking can help limit smoking-related bone loss. Stopping smoking begins to reverse this risk after about 10 years.
Q28. How much time do I need to spend in the sun to get enough vitamin D? Patient queries The amount of time you need to spend in the sun to get enough vitamin D depends on various factors, including your skin type, the time of day, your location, and the amount of skin exposed. It is generally recommended that people with lighter skin spend about 10–30 minutes in the sun at least twice a week, while those with darker skin may need more time, about 20–60 minutes. This should be done with at least 40% of your skin exposed, such as your arms, legs, and face. The amount of time you need to spend in the sun to get enough vitamin D depends on a number of factors, including your skin color, the time of year, the time of day, and your location.
If you’re concerned about your vitamin D levels, talk to your doctor. They can recommend a blood test to check your levels and advise you on how to get enough vitamin D.
Most people need to get 5 to 30 minutes of sun exposure at least 2 days a week for their body to make sufficient amounts of vitamin D. The amount of time needed also depends on the weather conditions and the time of year.1 Midday sun exposure may help you reach your recommended amount of vitamin D in less time.2 However, too much sun exposure can have side effects.2

Sample illustrative AI chatbot responses are provided. The comprehensive set of responses and reviewer scores is present in Supplementary Table S1.

Figure 1.

Figure 1

Scoring of chatbot performance by topic area. each dot represents the median of the −2 to 2 scores given by the four reviewers for that question. Lines represent the median of the reviewers’ scores. N = 30 questions per category. A Friedman test with Dunn’s multiple comparisons post hoc test was applied. P-values for the post hoc test are as indicated: (A) ****P = <.0001; **P = .0012; (B) ****P = <.0001; ***P = .0002. (C) ***P = .0004; ****P = <.0001.

Overall, a high degree of concordance among reviewers was achieved when scoring chatbot responses in each of the three categories as indicated by the Kendall’s W scores of interrater agreement (Table 1), where scores of 0.20 to 0.40 imply fair agreement, scores of 0.40 to 0.60 indicate excellent agreement, and scores >0.60 suggest substantial to almost perfect agreement. The highest discrepancy among reviewers was observed in scoring Bard’s performance in answering bone pathophysiology and physician management questions, where only a fair level of agreement was seen. For instance, scoring among reviewers varied when asked about Bard’s summary of the mechanisms of joint formation during early skeletal development, with two reviewers rating it as “excellent” and while two other reviewers assigned a lower score, noting that the response lacked critical details and failed to explain key processes involved in joint formation (Q10 of bone physiology). Likewise, reviewer assessments differed when grading a response pertaining to a DXA scan for a 48-year-old woman (Q3 of physician management). Reviewers differently weighted the importance of the Chatbot response acknowledging how recommendations from the US Preventative Services Task Force versus other professional society guidelines.

Table 1.

Interrater agreement of chatbot performance by category. Kendall’s coefficient of concordance was used to assess agreement among the four raters per topic area.

Kendall’s coefficient of concordance among reviewers
ChatGPT Bard Bing AI
Bone physiology 0.51202 0.34023 0.53762
Physician management 0.51692 0.25794 0.62763
Patient queries 0.46559 0.6667 0.64197

Among the three categories, ChatGPT was most proficient at generating complete and accurate responses to a diverse set of queries, with some of the responses rivaling what would be expected for a summary drafted by a subject matter expert. In contrast, Bard refused to generate a response to most queries, notably with no responses provided for 20/30 clinical management–related questions and 21/30 patient queries, giving answers similar to “I’m only a language model and don’t have the necessary information or abilities” (Table 2). These responses were assigned a score of zero. Bard was overall almost twice as likely to not respond to clinical queries as to the less clinical queries in the bone physiology category.

BingAI’s performance was broadly somewhat less consistent than that of ChatGPT, as BingAI was more likely to provide short answers that omit important concepts. For instance, when asked about causes of secondary osteoporosis, both ChatGPT and Bard provided long answers that organized causes of secondary osteoporosis by category (Q1), whereas BingAI gave a much shorter answer that, while essentially correct and concise, was not as systematic or thorough as the other LLM Chatbots. Similarly, when BingAI was asked how osteomalacia results in impaired skeletal mineralization (Q14), it provided a short definition of osteomalacia as opposed to answering the prompt. Notably, BingAI did provide citations by default, and an audit of these citations revealed them to be genuine web-accessible resources. However, in several cases, the references cited did not appear to adequately support the claims made in the response. Additionally, the references cited were often nonpeer reviewed, such as patient facing informational websites (eg, health line) or Wikipedia, and it was less common that primary sources of information were cited in responses generated by BingAI.

Across all categories, all of the AI chatbots had trouble conveying the relative strength of evidence behind their response, not consistently communicating which concepts are well established and supported by high-quality evidence and which concepts are controversial or supported by limited or low-quality evidence. All of the AI chatbots were sometimes observed to apparently take a couple of “keywords” from the prompt and provide disjointed responses to these keywords, though this appeared to be most common with BingAI. For example, when asked about what factors govern endochondral versus intramembranous fracture healing and how this relates to distraction osteogenesis, BingAI provided separate short paragraphs describing endochondral versus intramembranous ossification and distraction osteogenesis without linking these concepts into a holistic response to the question posed (Q7, Table 2).

Bone physiology

Of the three AI chatbots examined, ChatGPT was the most adept at summarizing concepts of bone biology and bone pathophysiology (median score of 2 from 4 reviewers). Bard was unable to generate a response to 11/30 of the given prompts, resulting in a median combined score of 0.75. Responses from BingAI displayed the lowest scores in this category (median score of 0). Responses generated by BingAI were often brief and lacking detail.

The chatbots appeared to be capable of providing high-quality responses on topics that are well established. When asked to describe the secondary causes of osteoporosis (Q1) or the role of osteocytes in bone homeostasis (Q3), the AIs were largely successful in providing a detailed explanation similar to that expected of a subject matter expert. We speculate that these strong responses are due to the presence of a sufficient corpus of relevant training material on these better studied topics.

Bing AI sometimes demonstrated stronger performance when addressing questions aimed at specific contemporary studies. For instance, when asked about the function of PTHrP+ growth plate stem cells, BingAI appropriately cited the key relevant publication.19 In another example, the three AI chatbots were asked about the function of osteomorphs, a cell type recently proposed to be involved in osteoclast recycling20 (Q17). ChatGPT incorrectly assumed that the reference to osteomorphs in the prompt was an attempt to ask about osteocytes. In contrast, Bard and BingAI generated a more relevant summary of the key recent paper.

Physician management/diagnosis

There is interest in the ability of AI chatbots to either assist with formulating drafts of a patient-specific diagnostic and therapeutic management plan or as a tool to summarize information for nonspecialist clinicians.21 Therefore, we evaluated the ability of these AI chatbots to provide assessment and recommendations of sample clinical scenarios framed as brief case presentations. ChatGPT displayed the best overall scores (median score of 2), with Bard failing to provide an answer to most queries (19/30), leading to a median score of 0. BingAI also displayed lower overall scores than ChatGPT, also leading to a median score of 0. ChatGPT was noted to provide appropriate disclaimers that “it is not a doctor” and that physician should be consulted for medical advice, but BingAI generally did not provide such as disclaimer.

All of the chatbots struggled with considering how patient context, demographics, or comorbid conditions may alter recommendations. When presented with a 48-year-old woman inquiring about a DXA scan (Q3), ChatGPT and Bard provided a detailed response outlining the next steps but did not consider her menstrual status. BingAI ignored the given patient–physician scenario and provided irrelevant general information about DXA scans. All of the chatbots also struggled when asked to provide recommendations on whether premenopausal women being treated with glucocorticoids should receive interventions to decrease the risks of glucocorticoid-induced osteopenia (Q26). No response was generated by Bard, but both ChatGPT and BingAI failed to include several critical considerations including potential childbearing plans, the role of DXA scans, or how treatment may impact plans for lactation.

These LLMs generated more accurate responses when tasked with basic queries concerning common bone diseases. All three AI-based chatbots displayed adequate responses when asked about diagnosing Paget’s disease of bone (Q22), outlining the steps involved in diagnosing this common disorder. When prompted for information to share with patients who have suffered a hip osteoporotic fracture (Q6), ChatGPT and Bard were able to provide a succinct response highlighting the most important points covered in patient-facing educational materials.

Patient queries

There is intense interest in the ability of AI chatbots to be a patient-facing tool for health education or triage of health concerns. The ability of ChatGPT, BingAI, and Bard to serve this function was evaluated here. Each conversational AI was presented with typical patient questions relating to skeletal health, such as whether to take vitamin D supplements, how lifestyle modifications will impact bone health, recommendations for DXA screening, and common questions surrounding adverse effects of osteoporosis treatment with bisphosphonates. Of the three chatbots examined, ChatGPT scored the highest for content validity with most responses being considered adequate (score of 2) or only containing minor omissions or issues with accuracy (score of 1). Both BingAI and Bard displayed lower scores, with Bard failing to answer majority of queries (21/30), and BingAI providing numerous incomplete or inadequate responses (scores of −1 or less).

All chatbots were noted to have responses with important deficiencies in this category (scores of −1 or −2). For instance, when asked “What is the best treatment for osteoporosis?” (Q26), both ChatGPT and BingAI did not fully answer the question, providing lists of medication without adequate information on when each might be appropriate. The AI chatbots were also asked about how to improve fracture recovery (Q30) but failed to consider medications that may improve or inhibit healing. Additionally, BingAI provided an inappropriate suggestion that oral administration of vitamin C and quercetin promote bone healing.

However, the chatbots generally demonstrated a stronger ability to formulate detailed responses to common topics. For example, when asked if vitamin D supplements are necessary (Q2) or about calcium-rich foods for lactose-intolerant people (Q27), the chatbots were able to concisely highlight key information. Acceptable answers were also provided to other general questions such as recommended daily calcium intake (Q17), exercise recommendations for osteoporosis (Q20), and regular sun exposure to produce vitamin D (Q28). Thus, these chatbots may be more suitable for handling common, generic queries as opposed to questions where specialized knowledge, critical thinking, or consideration of patient-specific context are required.

Discussion

Although LLM chatbots offer exciting opportunities through their ability to potentially summarize complex topics on demand, it is crucial that that their risks and limitations be understood. Here, we explore their applications to different areas of basic science, clinical research, and patient care in the bone field. We find that AI LLM chatbots display mixed performance, appearing to excel in some categories while displaying clear weaknesses in others. Moreover, these areas of strength and weakness markedly varied by model.

First, it is important to acknowledge several limitations of the study methodology employed here. One limitation is that here, while patient queries were formulated to be representative of real-life patient questions, they were nevertheless formulated by health care professionals and may therefore diverge from typical patient concerns or phrasing in important ways. Careful phrasing of prompts must be considered to tailor responses accordingly, as chatbots may generate different responses based when framing the questioner as a health professional versus a patient, as seen in the example in Supplementary Table S2. In future studies, it would be beneficial to directly poll patients to obtain questions for similar analyses. Second, is that while posing questions solely as a single prompt in a new chatbot session without follow-up was highly desirable for maximizing the consistency of responses and reproducibility of the study methodology, interactive sessions consisting of multiple successive questions and user feedback on the chatbot’s response may generate significantly different answers than those obtained here, and may in particular be helpful for driving initially irrelevant responses toward the desired subject matter or for eliciting important contextual information in clinical questions.

Overall, ChatGPT displayed the strongest performance across categories, though there were still critical limitations in its performance. It generally excelled at summarizing well-established foundational concepts in skeletal biology, though it was inconsistent in incorporating newer concepts, for instance not discussing the role of LEPR+/CXCL12 abundant reticular cells when discussing populations of cells comprising the bone marrow stroma,22–25 and not discussing evidence that the pattern of dissemination of skeletal metastases may differ in breast versus prostate cancer26,27 or the importance of senescence in skeletal aging.28 This limitation occurred even keeping in mind the database cutoff date of September 2021. Due to this issue, ChatGPT may be useful in an educational context to summarize well-established concepts such as those that might appear in a textbook but should be used with caution and oversight as answers may not incorporate more important recent discoveries. We speculate that this may be due to the presence of a large corpus of historical literature embracing traditional viewpoints in the field that is given greater weight by virtue of repetition than a handful of more recent breakthrough studies. ChatGPT and other LLM chatbots are unable to discern which of these recent papers represent “breakthroughs” that should be weighted over historical literature in the field. If LLM chatbots are widely utilized for education and training, this issue poses the risk of potentially amplifying longstanding traditional concepts in the field at the expense of appropriately highlighting recent advances, perhaps slowing the rate at which these advances are incorporated into the field unless mitigating steps are taken. It is likely that there is no substitute for professional judgment in discerning which among recent publications represent “breakthroughs” that should alter fundamental concepts in the field and that this will remain and area of weakness for LLM chatbots unless manually curated data are utilized for training. However, it is important to acknowledge that even some subject matter experts may disagree substantially in how to prioritize recent literature.29 Similarly, subject matter experts are also more likely to disagree in areas outside of well-established areas of study, so some greater degree of uncertainty or worse performance on such topics is necessarily expected. More broadly, the training cutoff date for GPT-4 is detrimental for professional use since science and medicine are continuously evolving, leading to responses potentially omitting critical recent updates. However, we recognize that there are an increasing number of chatbots with real-time Internet access and search capabilities that may mitigate this issue.

For some queries, it is important to not only summarize available information but also contextualize the response with a sense of the quality of these available data. All of the chatbots struggled with consistently providing a sense of the level of evidence quality when formulating responses to both scientific and clinical queries. For instance, ChatGPT generally did not discuss the quality of evidence underlying recommendations, for instance when discussing platelet-rich plasma injections for osteoarthritis.

A key limitation with respect to clinical queries was the inconsistent ability of ChatGPT to take patient demographics or clinical context into account. For instance, it did not consider that a hypothetical 48-year-old woman may not be menopausal when recommending a DXA screening strategy. We speculate that this is due to an apparent tendency of ChatGPT to sometimes latch on to “keywords” within the prompt and provide an answer focused on these keywords that do not fully integrate all elements of the query. However, in some cases, chatbots may be better at eliciting clinical context in the setting of a multiprompt conversation rather than the isolated prompts used here.

Lastly, despite these limitations and the lack of consistency, there were a number of responses in each category where ChatGPT provided an outstanding and concise synthesis of the issues raised in the prompt. For instance, ChatGPT provided a very strong summary of the “vicious cycle” paradigm when asked about mechanisms of osteolysis in breast cancer metastases, Wolff’s law, or causes of secondary osteoporosis. Similarly, some clinical questions had a similarly strong response, such as in summarizing monitoring of patients with hypoparathyroidism, the role of bone turnover marker testing in patients with chronic kidney disease, or when recommending lifestyle modifications to reduce fracture risk. This suggests that one of the most compelling current uses for ChatGPT is as a “virtual assistant” to a subject matter expert who is capable of reviewing the responses to reject or adjust those that are inaccurate or incomplete. Additionally, these AI chatbots often offer a more informative and straight-forward response compared to a standard web search. For instance, the top results of a web search for diagnosing Paget’s disease does not include a full of list of tests a physician may order, whereas the chatbots offered a list of assessments followed by a brief explanation for each test. This use case may allow harnessing the often impressive ability of ChatGPT to generate professional content, while mitigating problems related to inconsistency or other weaknesses through manual review and editing. While we would not recommend using ChatGPT as an unsupervised patient-facing triage tool, it could be very useful in generating a “first draft” of patient-facing educational materials for review and editing by a subject matter expert. ChatGPT may have similar utility in generating “first draft” materials for courses on skeletal development or physiology or in generating “first draft” materials for a manuscript introduction or a review manuscript, though journals may have different stances on this topic and require disclosure or otherwise limit the practice. Additionally, it is noted that there is currently a crisis in osteoporosis treatment, with many patients and also some clinicians not being fully aware of the need for osteoporosis screening or the utility of current therapeutic options, leading to osteoporosis therapies being greatly underutilized.30 Perhaps a patient-facing LLM chatbot could be trained on a curated set of osteoporosis-relevant literature and thereby serve as an important and accessible tool in beginning to address this crisis.

It is important to note that utilizing these models to assist with creating content raises critical questions in the field regarding authorship and responsibility for the data. Many journals now have guidelines regarding how authors should acknowledge the use of LLM in their work. This is an important, challenging, and rapidly evolving topic.

Bard was inconsistent when addressing questions as it refused to generate a response nearly 70% of the time. Overall, it was unclear how Bard decided which questions to answer, as Bard did not appear to be more likely to provide a response to questions with a low risk of negative consequences for an incorrect answer than to questions with a high risk of negative consequences for an incorrect answer. Nevertheless, the overall rate of refusal to answer more directly clinical questions was almost twice the rate of refusal to answer bone pathophysiology questions, suggesting that this refusal to answer may be the product of an overtuned filter on providing medical advice that has spilled over into broader biomedical topics. Concerningly, this filter appeared to be inconsistent, as responses were generated 30% of the time and a response could be obtained to any prompt with repeated queries, suggesting that it is only stochastically effective. It is possible that a 70% effective content filter poses a greater risk than not having such a filter at all, as it creates a high degree of inconsistency in responses. When Bard did provide a response, answers were split between being similar in content to the ChatGPT response or being somewhat less comprehensive. Nonetheless, while a refusal to respond to a query is not particularly favorable, it may be preferable to providing an incorrect or incomplete response in most cases.

BingAI displayed several notable differences from ChatGPT. First, BingAI appeared to provide shorter, more concise answers than the other chatbots, which we speculate may be the product of positioning BingAI as more of a web search replacement than a general purpose text generation tool. While occasionally this brevity resulted in very clear answers, often it led to the BingAI responses missing key information, such as in its responses to whether quitting smoking will reverse osteoporosis. Similarly, when discussing the side effects associated with bisphosphonate treatment, BingAI omitted a mention of atypical femoral fractures.

BingAI also provides citations with hyperlinks as a default setting, which is a desirable characteristic for the use of a chatbot in a professional setting. However, the quality of citations varied greatly; ranging from Wikipedia, Britannica, or Quizlet to articles indexed on PubMed. Thus, it is unclear that peer-reviewed professional data sources are weighted over non–peer-reviewed content, including student study materials. Similar issues may also exist with Bard and ChatGPT, though are less immediately evident due to providing citations not being default for these chatbots. While current versions of publicly available LLMs are prone to hallucinations, in which the bot responds with inaccurate information that sounds plausible, most notably fabricating citations that do not correspond to any real-world publication,31 “hallucinated” citations, or other “hallucinated” content were not observed in the BingAI queries posed here. There are several potential solutions to this issue of “hallucinated” or otherwise inaccurate responses.5 One includes asking the chatbot to assume a hypothetical role to guide the tone and content of the response. For instance, asking the chatbot to pretend to be a consulting physician when giving a response can perhaps guide the response content and tone in a professional direction. Similarly, some chatbots have a “temperature” setting that allows the balance between “creativity” versus consistency to the material in the training dataset. Using a lower temperature setting (eg, the “more precise” setting for BingAI used here) can reduce the risk of hallucinations and may be generally more appropriate for the clinical and scientific use cases considered here. Additionally, another solution to issues of citing nonprofessional sources or inconsistency in response accuracy is to selectively train the LLM chatbot on professional sources and to involve domain-specific experts in model evaluation and tuning. Recent publication of Med-PaLM provides an example that such an approach can indeed improve LLM performance in addressing medical questions.32

Overall, AI chatbots are increasingly popular with a high degree of public awareness and intense interest in their potential applications. In many cases, the responses generated to queries focused on skeletal biology here were remarkable for their completeness, often rivaling what an expert in the field could produce. However, important limitations were observed, most notably in inconsistency in responses, a struggle to provide information relating to the patients’ individual needs, or an inability to holistically consider context or to appropriately weigh evidence based on quality when conflicting information is present in the literature. In some cases, it was unclear if peer-reviewed data sources were given priority over nonreviewed user posted content. Based on these limitations, while AI chatbots, particularly ChatGPT, have remarkable capabilities to generate content that is useable in professional contexts, it is recommended that this use be closely supervised by a domain expert who is capable of evaluating responses to reject or edit those that are incomplete or misleading, thereby mitigating the risk associated with the current limitations of these models. However, while this may be feasible for professional use, it is unrealistic that patient use of these tools be supervised. Instead, clinician awareness of the limitations of AI chatbots will enable them to be better able to anticipate and counter potential issues arising from their use by patients.

These models are likely to continue to advance in sophistication and utility, so it is important that their performance be reevaluated when major new versions are available. Lastly, while not the focus of this evaluation, we recognize that the responses provided here were all generated using training content generated by professionals in the skeletal biology and skeletal clinical communities. As such, the use of this content for model training raises numerous ethical and practical questions, including (1) to what degree those building these chatbot systems have a responsibility to accurately represent the source material and provide accurate clinical recommendations, (2) whether vendors creating these models have an obligation to support those creating content used for model training, and (3) whether those training these models are obligated to include relevant professionals in the training process if these models will be used for medical and scientific applications. Dialog on these and related questions are actively occurring in the lay press, and we submit that it will be important that skeletal biology research and clinical professionals be informed and represented in these critical discussions.

Supplementary Material

JBMR-LLM-Chatbots-Supplemental_Table_1_zjad007
JBMR-LLM-Chatbots-Supplemental_Table_2-sample-questions_zjad007

Acknowledgments

M.B.G. is supported by Pershing Square Sohn Cancer Research Alliance and Pershing Square MIND Prize Awards and an Irma T. Hirschl Career Scientist Award. This project was funded by the NIH under awards DP5OD021351 and R01AR075585 given to M.B.G, and P30AR072581 to E.A.I. M.B.G. holds a Career Award for Medical Scientists from the Burroughs Welcome Foundation.

Contributor Information

Michelle Cung, Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States.

Branden Sosa, Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States.

He S Yang, Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States.

Michelle M McDonald, Skeletal Diseases Program, The Garvan Institute of Medical Research, Darlinghurst, 2010, Australia; St Vincent’s Clinical Campus School of Clinical Medicine, University of New South Wales, Kensington, 2052, Australia; School of Medicine Science, Faculty of Medicine and Health, The University of Sydney, Sydney, 2006, Australia.

Brya G Matthews, Department of Molecular Medicine and Pathology, University of Auckland, Auckland, 1142, New Zealand; Center for Regenerative Medicine and Skeletal Development, School of Dental Medicine, UConn Health, Farmington, CT 06030, United States.

Annegreet G Vlug, Center for Bone Quality, Department of Internal Medicine, Leiden University Medical Center, Leiden, 2300, The Netherlands.

Erik A Imel, Indiana Center for Musculoskeletal Health, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, United States.

Marc N Wein, Endocrine Unit, Massachusetts General Hospital, Boston, MA 02114, United States.

Emily Margaret Stein, Division of Endocrinology, Hospital for Special Surgery, New York, NY 10021, United States; Metabolic Bone Service, Hospital for Special Surgery, New York, NY 10021, United States; Research Division, Hospital for Special Surgery, New York, NY 10021, United States.

Matthew B Greenblatt, Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States; Research Division, Hospital for Special Surgery, New York, NY 10021, United States.

Conflicts of interest

E.M.S. receives research support from Novartis and Radius. All other authors have no relevant conflicts of interest to declare.

Data availability

The authors confirm that all responses generated by the chatbots and the reviewers’ independent scores for each response are included in this published article.

References

  • 1.OpenAI . GPT-4 Technical Report March 15, 2023. arXiv. Available from: http://arxiv.org/abs/2303.08774.
  • 2.Thoppilan R, De Freitas D, Hall J, et al. LaMDA: language models for dialog applications. arXiV. January 20, 2022. Available from http://arxiv.org/abs/2201.08239.
  • 3.Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–596. 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Munoz-Zuluaga C, Zhao Z, Wang F, Greenblatt MB, Yang HS. Assessing the accuracy and clinical utility of ChatGPT in laboratory medicine. Clin Chem. 2023;69(8):939–940. 10.1093/clinchem/hvad058. [DOI] [PubMed] [Google Scholar]
  • 5.Yang HS, Wang F, Greenblatt MB, Huang SX, Zhang Y. AI chatbots in clinical laboratory medicine: foundations and trends. Clin Chem. 2023;69(11):1238–1246. 10.1093/clinchem/hvad106. [DOI] [PubMed] [Google Scholar]
  • 6.Deebel NA, Terlecki R. ChatGPT performance on the American urological association (AUA) self-assessment study program and the potential influence of artificial intelligence (AI) in urologic training. Urology. 2023;177:29–33. 10.1016/j.urology.2023.05.010. [DOI] [PubMed] [Google Scholar]
  • 7.Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141(6):589. 10.1001/jamaophthalmol.2023.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Weeks R, Sangha P, Cooper L, et al. Usability and credibility of a COVID-19 vaccine Chatbot for young adults and health Workers in the United States: formative mixed methods study. JMIR Hum factors. 2023;10:e40533. 10.2196/40533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Weeks R, Cooper L, Sangha P, et al. Chatbot-delivered COVID-19 vaccine communication message preferences of young adults and public health Workers in Urban American Communities: qualitative study. J Med Internet Res. 2022;24(7):e38418. 10.2196/38418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chin H, Lima G, Shin M, et al. User-Chatbot conversations during the COVID-19 pandemic: study based on topic modeling and sentiment analysis. J Med Internet Res. 2023;25:e40922. 10.2196/40922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Brender TD. Medicine in the era of artificial intelligence: hey chatbot, write me an H&P. JAMA Intern Med. 2023;183(6):507–508. 10.1001/jamainternmed.2023.1832. [DOI] [PubMed] [Google Scholar]
  • 13.Hutson M. Could AI help you to write your next paper? Nature. 2022;611(7934):192–193. 10.1038/d41586-022-03479-w. [DOI] [PubMed] [Google Scholar]
  • 14.Huang J, Tan M. The role of ChatGPT in scientific communication: writing better scientific review articles. Am J Cancer Res. 2023;13(4):1148–1154. [PMC free article] [PubMed] [Google Scholar]
  • 15.Zimmerman A. A ghostwriter for the masses: ChatGPT and the future of writing. Ann Surg Oncol. 2023;30(6):3170–3173. 10.1245/s10434-023-13436-0. [DOI] [PubMed] [Google Scholar]
  • 16.Tsang R. Practical applications of ChatGPT in undergraduate medical education. J Med Educ Curric Dev. 2023;10:238212052311784. 10.1177/23821205231178449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zhavoronkov A. Caution with AI-generated content in biomedicine. Nat Med. 2023;29(3):532. 10.1038/d41591-023-00014-w. [DOI] [PubMed] [Google Scholar]
  • 18.Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023;614(7947):214–216. 10.1038/d41586-023-00340-6. [DOI] [PubMed] [Google Scholar]
  • 19.Mizuhashi K, Ono W, Matsushita Y, et al. Resting zone of the growth plate houses a unique class of skeletal stem cells. Nature. 2018;563(7730):254–258. 10.1038/s41586-018-0662-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.McDonald MM, Khoo WH, Ng PY, et al. Osteoclasts recycle via osteomorphs during RANKL-stimulated bone resorption. Cell. 2021;184(5):1330–1347.e13. 10.1016/j.cell.2021.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Galido PV, Butala S, Chakerian M, Agustines D. A case study demonstrating applications of ChatGPT in the clinical management of treatment-resistant schizophrenia. Cureus. 2023;15(4):e38166. 10.7759/cureus.38166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Matsushita Y, Nagata M, Kozloff KM, et al. A Wnt-mediated transformation of the bone marrow stromal cell identity orchestrates skeletal regeneration. Nat Commun. 2020;11(1). 10.1038/s41467-019-14029-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhou BO, Yue R, Murphy MM, Peyer JG, Morrison SJ. Leptin-receptor-expressing mesenchymal stromal cells represent the main source of bone formed by adult bone marrow. Cell Stem Cell. 2014;15(2):154–168. 10.1016/j.stem.2014.06.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhong L, Yao L, Tower RJ, et al. Single cell transcriptomics identifies a unique adipose lineage cell population that regulates bone marrow environment. elife. 2020;9:9. 10.7554/eLife.54695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhou BO, Yu H, Yue R, et al. Bone marrow adipocytes promote the regeneration of stem cells and haematopoiesis by secreting SCF. Nat Cell Biol. 2017;19(8):891–903. 10.1038/ncb3570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhang W, Bado IL, Hu J, et al. The bone microenvironment invigorates metastatic seeds for further dissemination. Cell. 2021;184(9):2471–2486.e20. 10.1016/j.cell.2021.03.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Gundem G, Van Loo P, Kremeyer B, et al. The evolutionary history of lethal metastatic prostate cancer. Nature. 2015;520(7547):353–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Farr JN, Xu M, Weivoda MM, et al. Targeting cellular senescence prevents age-related bone loss in mice. Nat Med. 2017;23(9):1072–1079. 10.1038/nm.4385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–1940. 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
  • 30.Khosla S, Shane E. A crisis in the treatment of osteoporosis. J Bone Miner Res. 2016;31(8):1485–1487. [DOI] [PubMed] [Google Scholar]
  • 31.Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179. 10.7759/cureus.35179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–180. 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

JBMR-LLM-Chatbots-Supplemental_Table_1_zjad007
JBMR-LLM-Chatbots-Supplemental_Table_2-sample-questions_zjad007

Data Availability Statement

The authors confirm that all responses generated by the chatbots and the reviewers’ independent scores for each response are included in this published article.


Articles from Journal of Bone and Mineral Research are provided here courtesy of Oxford University Press

RESOURCES