Abstract
Currently, there are limited studies assessing ChatGPT ability to provide appropriate responses to medical questions. Our study aims to evaluate ChatGPT adequacy in responding to questions regarding osteoporotic fracture prevention and medical science. We created a list of 25 questions based on the guidelines and our clinical experience. Additionally, we included 11 medical science questions from the journal Science. Three patients, 3 non-medical professionals, 3 specialist doctor and 3 scientists were involved to evaluate the accuracy and appropriateness of responses by ChatGPT3.5 on October 2, 2023. To simulate a consultation, an inquirer (either a patient or non-medical professional) would send their questions to a consultant (specialist doctor or scientist) via a website. The consultant would forward the questions to ChatGPT for answers, which would then be evaluated for accuracy and appropriateness by the consultant before being sent back to the inquirer via the website for further review. The primary outcome is the appropriate, inappropriate, and unreliable rate of ChatGPT responses as evaluated separately by the inquirer and consultant groups. Compared to orthopedic clinicians, the patients’ rating on the appropriateness of ChatGPT responses to the questions about osteoporotic fracture prevention was slightly higher, although the difference was not statistically significant (88% vs 80%, P = .70). For medical science questions, non-medical professionals and medical scientists rated similarly. In addition, the experts’ ratings on the appropriateness of ChatGPT responses to osteoporotic fracture prevention and to medical science questions were comparable. On the other hand, the patients perceived that the appropriateness of ChatGPT responses to osteoporotic fracture prevention questions was slightly higher than that to medical science questions (88% vs 72·7%, P = .34). ChatGPT is capable of providing comparable and appropriate responses to medical science questions, as well as to fracture prevention related issues. Both the inquirers seeking advice and the consultants providing advice recognize ChatGPT expertise in these areas.
Keywords: ChatGPT, fracture prevention, medical science, question response
1. Introduction
In November 2022, a research version of an artificial intelligence (AI) system called ChatGPT was released and quickly gained popularity, with media reports indicating over 1 million users in just a few days.[1] ChatGPT is capable of composing emails, writing computer code, and creating movie scripts in response to written prompts, and researchers have even demonstrated its ability to pass medical licensing exams.[2] In the healthcare field, ChatGPT has been tested to simplify radiology reports for doctors[3] and generate patient discharge summaries based on brief prompts.[3,4] While its performance was generally correct in both cases, noticeable errors such as omitting key medical findings in radiology reports and adding extra information in discharge summaries were observed.[3,4] Currently, there are limited studies assessing ChatGPT ability to provide appropriate responses to medical questions. Therefore, the appropriateness of ChatGPT responses to medical issues has drawn wide concern. In a test containing 25 simple questions about cardiovascular disease prevention, ChatGPT provided basically appropriate answers to the patients.[5] In this study, we conducted a qualitative assessment of ChatGPT responses to questions concerning osteoporotic fracture prevention and general medical science. Our objective was to furnish evidence supporting the potential application of AI models in the field of medicine.
2. Methods
This questionnaire survey was conducted in February 2023. The research report was structured following the principles of the Helsinki Declaration, and the website obtained approval from their respective ethics committees. All participants provided written informed consent.
2.1. Ratings of ChatGPT responses to osteoporotic fracture prevention questions
Based on guidelines, clinical experience, and requirements, we designed 25 questions (eTable 1 in appendix; http://links.lww.com/MD/L860) based on fundamental preventive concepts for osteoporotic fractures, covering risk factor counseling, test results, and medication information. To simulate real-world scenarios, we distributed these questions to 3 simulated patients (non-healthcare professionals), who were asked to consult 3 experienced orthopedic clinicians for answers through a medical advice-based information website. Each clinician then asked the 25 questions to ChatGPT3.5 (https://chat.openai.com) on October 2, 2023 (i.e., each question was asked for 3 times), and then recorded and rated the responses as “appropriate,” “inappropriate,” or “unreliable” based on guidelines and clinical judgment.[6] One researcher collected the orthopedic clinicians’ evaluation results for final analysis. If all the 3 responses were rated as “appropriate,” the final result would be deemed appropriate. If all the 3 responses were rated as “inappropriate,” the final result would be deemed inappropriate. If ratings for the 3 responses to the same question were inconsistent, they would be deemed “unreliable.” If any of the 3 responses contained inappropriate information, all the responses would be deemed inappropriate. Subsequently, the orthopedic clinicians separately shared ChatGPT answers with the patients who had asked the questions, and the simulated patients were asked to further evaluate the clinicians’ responses as “appropriate,” “inappropriate,” or “unreliable” based on their judgment. One researcher collected the patients’ evaluation results for final analysis. If all the 3 responses were rated as “appropriate,” the final result would be considered appropriate. On the contrary, if all the 3 responses were rated as “inappropriate,” the final result would be considered inappropriate. If ratings for the 3 responses to the same question were inconsistent, the result would be labeled as “unreliable.” Besides, if any of the 3 responses contained inappropriate information, all the answers would be deemed inappropriate.
2.2. Ratings of ChatGPT responses to medical science questions
Based on Medicine & Health questions proposed by the journal Science, we created 11 medical science questions (eTable 2 in appendix; http://links.lww.com/MD/L861).[7] To simulate real-world scenarios, we distributed the questions to 3 simulated inquirers (non-medical degree holders who have graduated from university), who were asked to consult 3 experienced medical scientists for answers through a medical information website-like interface. The medical scientists would then obtain responses from ChatGPT3.5 on October 2, 2023 (i.e., 3 responses for 1 question) and rated them as “appropriate,” “inappropriate,” or “unreliable” based on their judgment. One researcher collected their evaluations for final analysis. If all the 3 responses were rated as “appropriate,” the final result would be considered appropriate. If all the 3 responses were rated as “inappropriate,” the final result would be considered inappropriate. If ratings for the 3 responses were inconsistent, they would be labeled as “unreliable.” If any of the 3 responses contained inappropriate information, all of them would be considered inappropriate.
The medical scientists would then send ChatGPT answers to the inquirers, who were asked to further evaluate the answers as “appropriate,” “inappropriate,” or “unreliable” based on their judgment. One researcher collected their evaluation results for final analysis. If all the 3 responses were rated as “appropriate,” the final result would be considered appropriate. If all the 3 responses were rated as “inappropriate,” the final result would be considered inappropriate. If ratings for the 3 responses were inconsistent, the result would be considered “unreliable.” Besides, if any of the 3 responses contained inappropriate information, all of them would be considered inappropriate.
2.3. Statistical analysis
All statistical significance tests were performed by 2-sided test and P < .05 was considered statistically significant. All statistical analyses were conducted using R statistical software, version 4.2.2 (R Foundation for Statistical Computing).
3. Results
The appropriateness rate for ChatGPT responses to questions about osteoporotic fracture prevention was 88% as rated by patients, and was 80% as rated by experts (P = .70). It indicates that the patients are more satisfied with ChatGPT answers. As for medical science questions, the patients rated the appropriateness as 72.7%, while the experts rated it as 81.8% (P = 1.00), suggesting that the scientists are more satisfied with ChatGPT performance (Table 1).
Table 1.
Distribution of ratings of ChatGPT response to 25 osteoporotic fracture prevention questions and medical science questions.
| Experts | Patients | |
|---|---|---|
| Questions about osteoporotic fracture prevention | ||
| Appropriate | 80% | 88% |
| Inappropriate | 12% | 8% |
| Unreliable | 8% | 4% |
| Questions about medical science | ||
| Appropriate | 81.8% | 72.7% |
| Inappropriate | 18.2% | 9.1% |
| Unreliable | 0% | 18.2% |
The experts’ ratings on the appropriateness of ChatGPT responses to osteoporotic fracture prevention questions and medical science questions were basically similar (P = 1.00). For the patients, their rating on ChatGPT responses to osteoporotic fracture prevention questions was higher than that to medical science questions, but without statistical significance (P = .34).
For the prevention of osteoporotic fractures, the patients’ appropriateness rate was 88%, with 8% considering ChatGPT responses as inappropriate and 4% as unreliable. For the sets of unreliable responses, 2 of 3 responses were inappropriate, and 1 of 3 were unreliable. For instance, when asked about the recommendation for individuals aged 65 or older with hip or vertebral fractures to exercise regularly (at least 3 times a week), ChatGPT recommended doing exercise strongly, but the patients believed that exercise would increase the risk of fractures and therefore considered the answer as unreliable. For the prevention of osteoporotic fractures, the experts’ appropriateness rate was 80%, with 12% considering ChatGPT responses as inappropriate and 8% as unreliable. For the sets of unreliable responses, 2 of 3 responses were inappropriate, and 1 of 3 were unreliable. For instance, when asked about the possibility of increasing calcium and vitamin D doses at discretion in the early stage after an osteoporotic fracture, ChatGPT responded that it was not recommended to do so. However, according to guidelines, the intake of calcium and vitamin D can be adjusted as needed, and one of the experts considered the answer as inappropriate. For medical science questions, the patients’ appropriateness rate was 72.7%, with 9.1% considering ChatGPT responses as inappropriate and 18.2% as unreliable. For the sets of unreliable responses, 1 set had 2 of 3 responses that were inappropriate and 1 of 3 responses that were unreliable; the other set had 1 of 3 responses that were inappropriate and 2 of 3 responses that were unreliable. For instance, for the question “Will we ever find a cure for the common cold?” ChatGPT replied that a cure for the common cold has not been found. However, the patients believed that the common cold could be caused by different viruses and therefore might be curable, so they considered the answer as unreliable. For medical science questions, the expert appropriateness rate was 81.8%, with 18.2% considering ChatGPT responses as inappropriate. For the sets of inappropriate responses, 1 set had 2 of 3 responses that were inappropriate; the other set had 1 of 3 responses that were inappropriate. For instance, for the question “What is the etiology of autism?,” an expert believed that the answer given by ChatGPT did not align with the current research progress and considered the answer as inappropriate.
4. Discussion
In this study, both the patients and experts were found to be satisfied with ChatGPT answers to questions about osteoporotic fracture prevention and general medical science issues. Compared to the experts, the patients rated ChatGPT responses to the questions about osteoporotic fracture prevention slightly higher, but the difference was not statistically significant. While for ChatGPT responses to general medical science questions, the patients and experts had similar levels of satisfaction. For the experts alone, they had similar levels of satisfaction with ChatGPT responses to both types of questions. For the patients, their rating on ChatGPT responses to questions about osteoporotic fracture prevention was slightly higher than that to medical science questions, but the difference was not statistically significant. This study found that ChatGPT could provide appropriate recommendations for osteoporotic fracture prevention and answers to general medical science questions. The results suggest that ChatGPT can enhance patient education by providing comprehensive and meticulous knowledge of medicine, and improve the communication between patients and doctors regarding the prevention of osteoporotic fractures. In the study conducted by Sarraju et al, ChatGPT accuracy in providing cardiovascular disease prevention recommendations was also quantitatively assessed. The results indicated that ChatGPT could offer largely appropriate responses to simple cardiovascular disease prevention questions, as evaluated by preventive cardiology clinicians. In our results, ChatGPT also presented largely appropriate responses for fracture prevention recommendations. Therefore, this supports the further enhancement of ChatGPT in providing answers to some simple medical questions.[4] Our study is the first to compare the patients’ and experts’ satisfaction with ChatGPT responses and to compare the appropriateness of ChatGPT responses to specific disease prevention questions and to general medical science questions.
Previous studies have shown that ChatGPT is able to provide appropriate answers to questions about cardiovascular disease and breast cancer prevention, suggesting that interactive AI can enhance patient education and improve communication between patients and doctors so as to assist with clinical workflows.[5,8] Research also indicates that ChatGPT clinical performance can provide appropriate imaging-related services, which has the potential to improve clinical work efficiency and ensure responsible use of radiology services.[9] Some researchers reported that ChatGPT performed similarly to human experts in answering genetics questions.[10] In addition to responses to certain medical questions, there are studies utilizing alternative AI models to diagnose Alzheimer disease in its early stages and predict its severity.[11]
It is important to recognize the pros and cons of using ChatGPT for medical consultations. On the one hand, ChatGPT shows a number of advantages, such as convenience, high accessibility, quick response, high cost-effectiveness, and a vast database of medical information. However, it is necessary to keep in mind that ChatGPT also has its limitations. For example, it may have a limited ability in accurately diagnosing medical conditions, and there is always the possibility of incorrect information being provided. Moreover, due to no human interaction involved, the patients cannot feel the personal touch and emotional support from face-to-face consultation. Therefore, while ChatGPT can be a helpful tool, it should never be viewed as a substitute for in-person medical care.
Our research results, when combined with previous findings, provide compelling evidence for the use of ChatGPT in patient consultations. Not only does this research help to validate the use of AI-powered tools like ChatGPT in healthcare, but it also offers valuable guidance for clinicians looking to provide high-quality care to their patients. By leveraging the insights and recommendations provided by ChatGPT, doctors can more effectively respond to patient inquiries and provide personalized medical advice that is tailored to each patient unique needs.
This study itself also has several limitations. Firstly, the questions we formulated regarding osteoporotic fracture prevention and medical science issues may not fully encompass their respective domains. Evaluating the appropriateness of ChatGPT solely based on these questions may be subjected to bias. Secondly, the evaluation of ChatGPT responses to each question was mainly based on the experts’ judgment following guidelines and clinical experience, and the patients’ judgment through personal experience, both involving some degree of subjectivity. Thirdly, the AI-generated answers did not include any evidence to support their statements and level of evidence, and as a result, the answers lacked evidence-based medicine significance. Finally, ChatGPT generates responses by learning language patterns and data, which could lead to misunderstandings or inaccuracies, especially when confronted with ambiguous questions. This has the potential to induce unnecessary concern or misguide patients.
5. Conclusion
ChatGPT offers comparable and appropriate responses to fracture prevention questions and medical science questions, and both the inquirers and consultants have recognized its capabilities.
Acknowledgments
We would like to thank XHG, JWB, and YX for their participation in answering the above question.
Author contributions
Conceptualization: Shuguang Gao, Miao He.
Investigation: Jiahao Meng, Ziyi Zhang, Hang Tang, Pan Liu.
Project administration: Miao He.
Resources: Jiahao Meng, Yifan Xiao.
Software: Yifan Xiao.
Supervision: Shuguang Gao, Miao He.
Writing – original draft: Jiahao Meng, Ziyi Zhang, Pan Liu.
Writing – review & editing: Shuguang Gao, Miao He.
Supplementary Material
Abbreviation:
- AI
- artificial intelligence
Supplemental Digital Content is available for this article.
The National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University(2021KFJJ06) and Natural Science Foundation of Hunan Province (2021JJ30040).
This study was approved by the Ethics Committee of the Xiangya Hospital, Central South University. Informed consent was obtained in writing from all individual participants included in the study. Participation was voluntary and anonymous. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the Declaration of Helsinki.
The authors have no conflicts of interest to disclose.
All data generated or analyzed during this study are included in this published article [and its supplementary information files].
How to cite this article: Meng J, Zhang Z, Tang H, Xiao Y, Liu P, Gao S, He M. Evaluation of ChatGPT in providing appropriate fracture prevention recommendations and medical science question responses: A quantitative research. Medicine 2024;103:11(e37458).
Contributor Information
Jiahao Meng, Email: 1773834816@qq.com.
Ziyi Zhang, Email: zzy_daisy@qq.com.
Hang Tang, Email: 1412972206@qq.com.
Yifan Xiao, Email: 568807916@qq.com.
Pan Liu, Email: 964252072@qq.com.
Shuguang Gao, Email: 251469675@qq.com.
References
- [1].ChatGPT: optimizing language models for dialogue. Access date December 11, 2022.
- [2].The Lancet Digital H. ChatGPT: friend or foe? Lancet Digit Health. 2023;5:e102. [DOI] [PubMed] [Google Scholar]
- [3].Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5:e107–8. [DOI] [PubMed] [Google Scholar]
- [5].Sarraju A, Bruemmer D, Van Iterson E, et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. 2023;329:842–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Gregson CL, Armstrong DJ, Bowden J, et al. UK clinical guideline for the prevention and treatment of osteoporosis. Arch Osteoporos. 2022;17:58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Sanders SJS. 125 questions: Exploration and Discovery. AAAS Custom Publishing Office: Washington, DC, USA. 2021. [Google Scholar]
- [8].Haver HL, Ambinder EB, Bahl M, et al. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology. 2023;307:e230424. [DOI] [PubMed] [Google Scholar]
- [9].Rao A, Kim J, Kamineni M, et al. Evaluating ChatGPT as an adjunct for radiologic decision-making. 2023:2023.02. 02.23285399. [DOI] [PMC free article] [PubMed]
- [10].Duong D, Solomon BD. Analysis of large-language model versus human performance for genetics questions. Eur J Hum Genet. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Agbavor F, Liang H. Artificial Intelligence-enabled end-to-end detection and assessment of alzheimer’s disease using voice. Brain Sci. 2022;13:28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
