Skip to main content
The Journal of Spinal Cord Medicine logoLink to The Journal of Spinal Cord Medicine
. 2024 Jun 11;48(5):852–857. doi: 10.1080/10790268.2024.2361551

Assessment of the reliability and usability of ChatGPT in response to spinal cord injury questions

Fatma Özcan 1,, Merve Örücü Atar 1, Özlem Köroğlu 1, Bilge Yılmaz 1
PMCID: PMC12329821  PMID: 38860862

Abstract

Objective

The use of artificial intelligence chatbots to obtain information about patients’ diseases is increasing. This study aimed to determine the reliability and usability of ChatGPT for spinal cord injury-related questions.

Methods

Three raters simultaneously evaluated a total of 47 questions on a 7-point Likert scale for reliability and usability, based on the three most frequently searched keywords in Google Trends (‘general information’, ‘complications’ and ‘treatment’).

Results

Inter-rater Cronbach α scores indicated substantial agreement for both reliability and usability scores (α between 0.558 and 0.839, and α between 0.373 and 0.772, respectively). The highest mean reliability score was for ‘complications’ (mean 5.38). The lowest average was for the ‘general information’ section (mean 4.20). The ‘treatment’ had the highest mean scores for the usability (mean 5.87) and the lowest mean value was recorded in the ‘general information’ section (mean 4.80).

Conclusion

The answers given by ChatGPT to questions related to spinal cord injury were reliable and useful. Nevertheless, it should be kept in mind that ChatGPT may provide incorrect or incomplete information, especially in the ‘general information’ section, which may mislead patients and their relatives.

Keywords: ChatGPT, Spinal cord injury, Reliability, Artificial intelligence

Introduction

Especially in recent years, with the development of digital technology, changes in many areas of health services such as digital consultation, telemedicine, telehealth, remote treatment and mobile health applications have been seen to improve health services (1). Artificial intelligence (AI), which includes features such as natural language processing, knowledge representation, automatic reasoning, machine learning, computational learning, and computational communication, was first defined in the 1950s by McCarthy et al. (2–4). AI-based chatbots such as Woebot, OneRemission, Senper, Babylon Health, Infermedica, Gyant and Cancer Chatbot are being used for many different purposes such as diagnosis, screening, patient monitoring and evaluation of treatment options in the field of health.

ChatGPT (Chat Generative Pre-trained Transformer) is an artificial intelligence chatbot developed by OpenAI whose goal is to generate text that can be used for various processing tasks such as natural language imitation, translation, text summarization and dialogue systems (5). In addition, ChatGPT can generate responses, answer questions, write creative stories and process sequential data to create coherent texts by analyzing the relationship between sentences in a text (6).

Spinal cord injury (SCI) is a chronic condition that requires regular follow-up and treatment. The prevalence of SCI is estimated at around 291,000 people by the National Spinal Cord Injury Statistical Centre (7). Regular clinician follow-up is important to manage complications and optimize functional status in this patient group, where the incidence is so high and multiple complications are seen (8). However, several barriers, such as lack of financial resources, problems accessing medical resources, lack of social support and confusion due to conflicting information from different healthcare providers, make it difficult for individuals to self-manage their chronic conditions (9). The Internet and social media platforms offer opportunities for patients and caregivers to overcome these limitations and help patients improve their self-management of chronic conditions (10).

Therefore, the study had two aims: (1) to assess the reliability of ChatGPT for SCI-related responses, (2) to determine the usability of ChatGPT for SCI-related responses.

Materials and methods

The top three most searched keywords for SCI were identified based on individual Google Trends (https://trends.google.com/) search results on 13 October 2023.

Search parameters included the whole world, the time period from 2000 to the present, and the subcategory ‘health’, after which the ‘most relevant’ option was selected in the ‘related questions’ section. Similarly repeated keywords were excluded. The most frequently searched term on Google was used. A total of 47 questions were identified according to the three most frequently searched keywords ‘general information’, ‘complications’ and ‘treatment’ related to SCI. These questions were developed jointly by all the authors under the guidance of a physiotherapy and rehabilitation specialist (B.Y.) with over 20 years’ experience in the field of SCI. ‘General information’ section consisted of five questions including definition, epidemiology, classification, diagnosis and incomplete spinal cord syndromes. ‘Complications’ section consisted of 34 questions including general information, respiratory problems, deep vein thrombosis, autonomic dysfunction, heterotopic ossification, osteoporosis, neurogenic bowel dysfunction, neurogenic bladder dysfunction, sexual dysfunction and infertility, pressure ulcers, spasticity and neuropathic pain. Three questions were asked for each complication: general information, diagnosis and treatment.

‘Treatment’ section consisted of 8 questions including functional electrical stimulation, body-weight supported gait training, occupational therapy, virtual reality, wheelchair skills training, vocational rehabilitation, transcranial magnetic stimulation and transcranial direct current stimulation.

After a thorough cleaning of all browser-related data, in the ChatGPT AI chatbot's ‘Conversation’ section, the questions prepared for the SCI were written. All answers were evaluated simultaneously by three independent specialists in physical medicine and rehabilitation (F.Ö., M.Ö.A. and Ö. K.), who were blinded to the answers of the other person to avoid possible bias. ChatGPT-4 responses are from the 14 March version (https://chat.openai.com/).

Each chatbot was rated on a scale of 1 to 7 (which 1 is the lowest score and 7 is the highest score) for its reliability and usefulness in two categories (11).

The approval of an ethics committee and informed consent were not required for this study, as related research has followed a similar pathway (11, 12), as conversations were evaluated on ChatGPT and no animals/human participants were involved.

Statistical analysis

Statistical analysis was performed using SPSS version 15.0 (IBM®, Chicago, USA). For descriptive statistics, numerical data were expressed as mean ± standard deviation. Mann–Whitney U test and Kruskal–Wallis test were used to compare numerical variables between groups. Strength of agreement between all raters was determined as: slight (≤0.20), fair (range 0.21–0.40), moderate (range 0.41–0.60), substantial (range 0.61–0.80) and almost perfect (range 0.81–1.00) (13). Inter-rater reliability was determined by calculating intraclass correlation coefficients with 95% confidence intervals in the two-way mixed-effects model. A P value of less than 0.05 was defined as statistically significant.

Results

Three independent experienced raters evaluated ChatGPT's answers to 47 questions based on the three most searched for keywords in Google Trends using a seven-point Likert scale. Table 1 shows the inter-rater reliability and usability agreement between the raters. Inter-rater Cronbach α scores indicated ‘substanstial’ agreement for reliability and usability scores (α between 0.558 and 0.839, and α between 0.373 and 0.772, respectively).

Table 1.

Inter-rater reliability and usability agreement between the raters.

  Cronbach α (95%CI lower to upper) for reliability Cronbach α (95%CI lower to upper) for reliability
General Information 0.525 (−1.400–0.947) 0.783 (−0.098–0.976)
Complications 0.756 (0.569–0.870) 0.710 (0.488–0.845)
Treatment 0.670 (−0.116–0.928) 0.625 (−0.267–0.918)

Note: CI, confidence interval.

When analyzing the distribution of Likert scores in terms of question topics, it was found to be between 3 and 7. Based on the answers to 47 questions, the highest reliability score (rater 1 and 2; point 6 and rater 3; point 7) was for sexual dysfunction and infertility-diagnosis and pressure ulcers-diagnosis. The highest usability score (both raters; point 6) was for respiratory problems-general information, sexual dysfunction and infertility-diagnosis, sexual dysfunction and infertility-treatment, and pressure ulcers-diagnosis. The questions about the heterotopic osssification-general information (rater 1 and 3; point 4 and rater 2; point 3),

osteoporosis-diagnosis (rater 1; point 3 and rater 2 and 3; point 4) and pressure ulcers-treatment (rater 1; point 3 and rater 2 and 3; point 4) were given the lowest reliability score. The lowest usability score (all raters; point 4) was for deep vein thrombosis-treatment and heterotopic ossification-treatment.

The total scores of the three sections and their rating by each rater are presented in Table 2. No statistically significant difference was found between the three raters for reliability scores for ‘general information’ and ‘treatment’ sections (P values; 0.918, 0.886, respectively). A statistically significant difference was found in the ‘complications’ section for both reliability and usability (P values; 0.001, < 0.001, respectively). According to the results of the paired comparison test for reliability, a statistically significant difference was found between Rater 1–Rater 3 and Rater 2–Rater 3 (P values; 0.001, 0.004, respectively). When a pairwise comparison was made in terms of usability, a statistically significant difference was observed between Rater 1–Rater 3 and between Rater 2–Rater 3 (P ≤ 0.001).

Table 2.

Comparison of reliability and usefulness total ratings of the sections.

  General Information Complications Treatment P valuea
Rater≠1        
 Reliability 4.20 ± 8.36 4.44 ± 9.27 4.87 ± 0.35 0.266
 Usability 4.80 ± 0.44 4.20 ± 8.36 5.75 ± 0.46 0.015
Rater≠2        
 Reliability 4.40 ± 5.47 4.61 ± 0.81 4.87 ± 1.35 0.534
 Usability 4.80 ± 0.83 4.40 ± 5.47 5.50 ± 0.75 0.200
Rater≠3        
 Reliability 4.40 ± 5.47 5.38 ± 1.07 5.12 ± 1.12 0.147
 Usability 5.80 ± 0.44 4.40 ± 5.47 5.75 ± 0.46 0.904
Reliability Pb 0.918 0.001 0.886  
Usability Pb 0.042 <0.001 0.743  

Note: Data are expressed as mean ± standard deviation.

a

Kruskal–Wallis test.

b

Mann-Whitney U test.

Looking at the mean scores of all raters separately for the three sections, the highest mean reliability score was for ‘complications’ (mean 5.38). The lowest average was for the ‘general information’ section (mean 4.20). The ‘treatment’ had the highest mean scores for the usability (mean 5.87) and the lowest mean value was recorded in the ‘general information’ section (mean 4.80).

When comparing topics by raters, there was no statistically significant difference in either reliability or usability scores except for usability for Rater 1 (P values; 0.266, 0.534, 0.147 for reliability, respectively) (P values; 0.015, 0.200, 0.904 for usability, respectively) (Table 2). For Rater 1, there was a statistically significant difference between ‘general information’-‘treatment’ and ‘complications’-‘treatment’ in terms of usability (P values; 0.009, 0.0011, respectively).

Discussion

The main implication of this study is that the ChatGPT conversational bot, with some restrictions, shows reliability and usefulness in the spinal cord injury-related questions. The ‘general information’ section had the lowest mean reliability and usability scores.

When the responses to a total of 47 questions were analyzed, the highest reliability scores were given for sexual dysfunction and infertility-diagnosis and pressure ulcer-diagnosis. Figure 1 shows the response to ChatGPT for the diagnosis of pressure ulcer for SCI. Although most of the information provided is verified from medical scientific sources, such as clinical assessment, staging, wound assessment and laboratory tests used to diagnose pressure ulcers (14–16), two assessors awarded 6 points for incomplete information in the clinical assessment of the patient (intrinsic and extrinsic risk factors for pressure ulcers, detailed musculoskeletal and neurological examination). An important detail at the end of the answer is to direct the user to health professionals by emphasizing the importance of a multidisciplinary approach to correct diagnosis. The highest usability score was for respiratory problems-general information, sexual dysfunction and infertility-diagnosis, sexual dysfunction and infertility-treatment, and pressure ulcers-diagnosis. In the light of this information, it can be assumed that the answers given by ChatGPT to questions about the diagnosis of sexual dysfunction and infertility and pressure ulcer are reliable and useful.

Figure 1.

Figure 1

ChatGPT's response about the diagnosis of pressure ulcers in spinal cord injury.

The questions about the heterotopic osssification-general information, osteoporosis-diagnosis and pressure ulcers-treatment were given the lowest reliability score. The lowest usability score was for deep vein thrombosis-treatment and heterotopic ossification-treatment. The general information about heterotopic ossification was among the responses with the lowest reliability score and the treatment of heterotopic ossification was among the responses with the lowest usability score. Therefore, when seeking information on heterotopic ossification, patients should bear in mind that the answers provided by ChatGPT may contain inaccurate and/or incomplete information.

In this study, the highest mean reliability scores were found for ‘complications’ and the highest mean usability scores were found for ‘treatment’, while the lowest scores were found for the ‘general information’ section. A common observation of the evaluators is that ChatGPT provides more reliable and useful answers when asked about a more specific topic. It provides more superficial information on general topics such as epidemiology, risk factors and classification. The above-mentioned result may be related to this situation.

It is known that patients use social media platforms such as Google, YouTube, Twitter, WhatsApp and Instagram to get information about the diagnosis, complications and different treatments of their disease (17, 18), but the use of artificial intelligence robots such as ChatGPT is expected to increase in the near future. According to the current study, the finding that ChatGPT is generally reliable and useful for patients to get information about SCI supports this idea. Although there are studies investigating the use of social media such as Instagram and Twitter in SCI, there are no studies in the literature investigating the reliability and usability of ChatgGPT in this field. A study analyzing SCI posts by patients on Instagram and Twitter found that the most common themes were spreading positivity, the appearance of the wheelchair, and recovery or rehabilitation (19). They interpreted the widespread use of the themes of positivity and awareness by patients as supporting the use of social media as a support mechanism for people living with SCI. A study investigating the use of social media by family carers of people with SCI to support the patient found that they most frequently used Facebook, Instagram and YouTube for emotional support, social companionship and informational support respectively (17). However, artificial intelligence robots such as ChatGPT were not included in this study. Therefore, further studies are needed to compare ChatGPT with other social media platforms and with other artificial intelligence robots. In the study by Temel et al., the quality of ChatGPT responses to 23 questions related to SCI was assessed using the Ensuring Quality Information for Patients tool, and readability was assessed using the Flesch-Kincaid grade level and Flesch-Kincaid reading ease parameters (20). They found that the answers given by ChatGPT were of low quality, of high linguistic complexity and required a level of education equivalent to about 14–15 years of formal education to understand. However, this study did not assess the reliability and usability of the responses. As a result, the present study is the first to assess the reliability and usability of ChatGPT responses in the SCI population.

The limitations of the current study are that it is not known from which source the answers given by ChatGPT are taken, it is not known whether the source is reliable or not, and ChatGPT may give different answers to questions written at different times. Therefore, it should be noted that if a biased source is used as a source, the answers given by ChatGPT may also be biased. Other limitations of the study may be the low number of raters and keywords. The advantage of the study is that the evaluators did the evaluation at the same time to avoid bias.

Conclusion

It was concluded that ChatGPT's responses to questions about SCI were reliable and useful. Although it is believed that AI robots such as ChatGPT can facilitate the acquisition of information by patients with chronic conditions such as SCI, it should be noted that ChatGPT may provide incorrect or incomplete information, which could lead to significant deficiencies in disease management. AI robot developers should verify that health-related data is provided by reliable sources and provide regular updates on this issue. Therefore, when using ChatGPT in clinical practice, it should be kept in mind that it is a supportive tool for the clinician and the patient to access information, taking into account its benefits and potential harms, and that it cannot replace the clinician.

Disclaimer statements

Declaration of Generative AI and AI assisted technologies in the writing process: None.

Ethics approval and consent to participate: The approval of an ethics committee and informed consent were not required for this study, as no animals/human participants were involved.

Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of competing interest: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Availability of data and materials: The data of study are available from the corresponding author, upon reasonable request.

References

  • 1.Turakhia MP, Desai SA, Harrington RA.. The outlook of digital health for cardiovascular medicine: challenges but also extraordinary opportunities. JAMA Cardiol. 2016 Oct 1;1(7):743–744. doi: 10.1001/jamacardio.2016.2661. PMID: 27580275. [DOI] [PubMed] [Google Scholar]
  • 2.Gunawan J. Exploring the future of nursing: insights from the ChatGPT model. Belitung Nurs J. 2023 Feb 12;9(1):1–5. doi: 10.33546/bnj.2551. PMID: 37469634; PMCID: PMC10353608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McCarthy J, Minsky ML, Shannon CE.. A proposal for the Dartmouth summer research project on artificial intelligence – August 31, 1955. AI Mag. 2006;27(4):12–14. [Google Scholar]
  • 4.Mintz Y, Brodie R.. Introduction to artificial intelligence in medicine. Minim Invasive Ther Allied Technol. 2019 Apr;28(2):73–81. doi: 10.1080/13645706.2019.1575882. Epub 2019 Feb 27. PMID: 30810430. [DOI] [PubMed] [Google Scholar]
  • 5.OpenAI . CHATGPT: optimizing language models for dialogue. OpenAI. 2023 Feb 2. [accessed 2023 Feb 26]. https://openai.com/blog/chatgpt/.
  • 6.Khan RA, Jawaid M, Khan AR, Sajjad M.. ChatGPT – reshaping medical education and clinical management. Pak J Med Sci. 2023 Mar-Apr;39(2):605–607. doi: 10.12669/pjms.39.2.7653. PMID: 36950398; PMCID: PMC10025693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.National Spinal Cord Injury Center . Spinal cord injury: facts and figures at a glance. 2019. [accessed 2019 Dec 4]. https://www.nscisc.uab.edu/.
  • 8.van Diemen T, Verberne DPJ, Koomen PSJ, Bongers-Janssen HMH, van Nes IJW.. Interdisciplinary follow-up clinic for people with spinal cord injury: a retrospective study of a carousel model. Spinal Cord Ser Cases. 2021 Sep 27;7(1):86. doi: 10.1038/s41394-021-00451-0. PMID: 34580276; PMCID: PMC8476487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liddy C, Blazkho V, Mill K.. Challenges of self-management when living with multiple chronic conditions: systematic review of the qualitative literature. Can Fam Physician. 2014 Dec;60(12):1123–1133. PMID: 25642490; PMCID: PMC4264810. [PMC free article] [PubMed] [Google Scholar]
  • 10.Arcury TA, Sandberg JC, Melius KP, Quandt SA, Leng X, Latulipe C, Miller DP, Smith DA, Bertoni AG.. Older adult internet use and eHealth literacy. J Appl Gerontol. 2020 Feb;39(2):141–150. doi: 10.1177/0733464818807468. Epub 2018 Oct 24. PMID: 30353776; PMCID: PMC6698430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Uz C, Umay E.. ‘Dr ChatGPT’: is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis. 2023 Jul;26(7):1343–1349. doi: 10.1111/1756-185X.14749. Epub 2023 May 23. PMID: 37218530. [DOI] [PubMed] [Google Scholar]
  • 12.Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, Ayoub W, Yang JD, Liran O, Spiegel B, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023 Jul;29(3):721–732. doi: 10.3350/cmh.2023.0089. Epub 2023 Mar 22. PMID: 36946005; PMCID: PMC10366809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Landis JR, Koch GG.. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159–174. doi: 10.2307/2529310. PMID: 843571. [DOI] [PubMed] [Google Scholar]
  • 14.Kottner J, Cuddigan J, Carville K, Balzer K, Berlowitz D, Law S, Litchford M, Mitchell P, Moore Z, Pittman J, et al. Prevention and treatment of pressure ulcers/injuries: the protocol for the second update of the international clinical practice guideline 2019. J Tissue Viability. 2019 May;28(2):51–58. doi: 10.1016/j.jtv.2019.01.001. Epub 2019 Jan 11. PMID: 30658878. [DOI] [PubMed] [Google Scholar]
  • 15.Kottner J, Cuddigan J, Carville K, Balzer K, Berlowitz D, Law S, Litchford M, Mitchell P, Moore Z, Pittman J, et al. Pressure ulcer/injury classification today: an international perspective. J Tissue Viability. 2020 Aug;29(3):197–203. doi: 10.1016/j.jtv.2020.04.003. Epub 2020 May 1. PMID: 32414554. [DOI] [PubMed] [Google Scholar]
  • 16.Delmore B, Ayello EA, Smart H, Tariq G, Sibbald RG.. Survey results from the Gulf region: NPUAP changes in pressure injury terminology and definitions. Adv Skin Wound Care. 2019 Mar;32(3):131–138. doi: 10.1097/01.ASW.0000553108.70752.f6. PMID: 30801351. [DOI] [PubMed] [Google Scholar]
  • 17.Cathcart HF, Mohammadi S, Erlander B, Robillard JM.. Correction: evaluating the role of social media in providing support for family caregivers of individuals with spinal cord injury. Spinal Cord. 2023 Dec;61(12):690. doi: 10.1038/s41393-023-00936-9. Erratum for: Spinal Cord. 2023 Aug;61(8):460–465. PMID: 37783798. [DOI] [PubMed] [Google Scholar]
  • 18.Hogan TP, Hill JN, Locatelli SM, Weaver FM, Thomas FP, Nazi KM, Goldstein B, Smith BM.. Health information seeking and technology use among veterans with spinal cord injuries and disorders. PM R. 2016 Feb;8(2):123–130. doi: 10.1016/j.pmrj.2015.06.443. Epub 2015 Jul 8. PMID: 26164352. [DOI] [PubMed] [Google Scholar]
  • 19.Gajjar AA, Le AHD, Jacobs R, Mooney JH, Lavadi RS, Kumar RP, White MD, Elsayed GA, Agarwal N.. Patient perception of spinal cord injury through social media: an analysis of 703 Instagram and 117 Twitter posts. J Craniovertebr Junction Spine. 2023 Jul–Sep;14(3):288–291. doi: 10.4103/jcvjs.jcvjs_87_23. Epub 2023 Sep 18. PMID: 37860020; PMCID: PMC10583803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Temel MH, Erden Y, Bağcıer F.. Information quality and readability: ChatGPT's responses to the most common questions about spinal cord injury. World Neurosurg. 2023 Nov 22;181:e1138–e1144. doi: 10.1016/j.wneu.2023.11.062. Epub ahead of print. PMID: 38000671. [DOI] [PubMed] [Google Scholar]

Articles from The Journal of Spinal Cord Medicine are provided here courtesy of Taylor & Francis

RESOURCES