Skip to main content
Paediatric & Neonatal Pain logoLink to Paediatric & Neonatal Pain
. 2026 Feb 17;8(1):e70023. doi: 10.1002/pne2.70023

A Longitudinal Analysis of the Usefulness, Readability, Consistency, and Capacity of Artificial Intelligence Chatbot Responses Regarding the Reality of Chronic Pain in Children

Joshua W Pate 1,, Rebecca Fechner 1, Scott D Tagliaferri 2,3, Elan Schneider 4, Bruno Saragiotto 1,5
PMCID: PMC12914133  PMID: 41717540

ABSTRACT

To assess longitudinal improvements in generative AI chatbot responses to a sensitive pediatric chronic pain prompt and to evaluate the impact of providing explicit scoring criteria on their performance. In January 2025, four GenAI chatbots (ChatGPT‐4o, Microsoft Copilot, Google Gemini 2.0 Experimental Advanced, and Claude Sonnet 3.5 v2) were each prompted 10 times: “I am a child with chronic pain. Is it all in my head?” Responses were scored using 10 predefined criteria (e.g., empathetic tone, evidence‐based content, and child‐friendly language). Readability was assessed by Flesch–Kincaid Grade Levels. Responses were compared to a baseline collected in January 2024. Subsequently, explicit scoring criteria were provided as context to the chatbots, and the test was repeated. Compared with January 2024, the January 2025 responses showed substantial improvements in usefulness, consistency, and readability across all chatbots. When provided with explicit scoring criteria, all systems achieved maximum usefulness scores (10/10) and attained a readability level below the 7th grade. The observed enhancements indicate rapid advancements in AI performance over 1 year. Structured guidance via explicit scoring criteria markedly improved the ability of the chatbots to deliver empathetic, evidence‐based, and accessible responses tailored to pediatric chronic pain concerns. These findings highlight the importance of continuous benchmarking as AI technologies evolve. GenAI chatbots can substantially improve in delivering high‐quality, contextually appropriate health information for pediatric chronic pain. Further research should refine evaluation metrics and explore multi‐prompt, real‐world applications to ensure robust and safe integration of AI in clinical practice.

Keywords: artificial intelligence, capacity, chatbots, children, chronic pain, consistency, readability, usefulness

1. Introduction

Generative Artificial Intelligence (GenAI) chatbots using Large Language Models (LLMs) have become increasingly accessible to the public. Such chatbots have the potential to provide tailored interactions with high‐quality resources [1]. However, the quality of consumer health resources, especially concerning pain management, remains inconsistent [2]. In early 2024, we evaluated the usefulness and consistency of GenAI chatbots addressing a common stigmatizing concern that children with chronic pain may convey [3]; the question, “Is it all in my head?” [4]. This evaluation served as a preliminary benchmark for assessing GenAI model performance in addressing sensitive health‐related queries of a non‐technical user. We found that the criteria for high‐quality clinical responses were not met, and the chatbot responses were not child‐friendly. Since then, new versions of each chatbot have been released.

Longitudinal studies are needed because the majority of healthcare studies on LLMs have been cross‐sectional [5]. However, LLM behavior can change significantly over time due to model updates [6]. Replicating our “zero‐shot prompt” study 1 year later provides important insights about how GenAI chatbots have changed in terms of trustworthy patient‐centered communication.

The potential of GenAI chatbots to improve responses when provided with specific benchmarking criteria has not been assessed. For this scenario of pediatric chronic pain, the capacity of current systems to respond more optimally when provided with the context of our benchmarking criteria is unknown. The primary aim of this study was to assess longitudinal changes in the usefulness, readability, and consistency of responses provided by common GenAI chatbots to the aforementioned question, and our second (exploratory) aim is to assess the capacity of each chatbot to score higher when given the explicit scoring criteria as context.

2. Methods

This longitudinal study was exempt from approval by a Human Research Ethics Committee and exempt from requiring informed consent because no identifiable patient‐level data were used, and the data were public. We followed the TRIPOD‐LLM reporting guideline which is designed for the transparent reporting of studies of large language models [7].

2.1. Data Collection

The prompt used (“I am a child with chronic pain. Is it all in my head?”) was tested following the same protocol 1 year earlier (2 January 2024) on the chatbots available at the time.

On 2 January 2025, we prompted the following commonly accessed chatbots: ChatGPT 4o; Microsoft Copilot; Google Gemini 2.0 Experimental Advanced; and Claude Sonnet 3.5 v2. Each chatbot was presented with the same prompt 10 times, using a new chatbot session each time to ensure no memory or learning effects occurred.

Following this, we provided our scoring criteria and asked each chatbot to “try responding again knowing this context”. These response criteria (Appendix S1) included: (1) sorry to hear, (2) actively addresses that pain is not in your head, (3) pain is multifactorial, (4) referral to health professional, (5) evidence‐based, (6) not alone, (7) coaching tone, (8) asked clarifying questions, (9) child‐friendly language (Flesch–Kincaid Grade Level < 7), and (10) word count not too short or long (100–300 words). Each chatbot response was blinded and randomly ordered (by JWP using a random number generator) before evaluation.

2.2. Data Analysis

The usefulness of responses was analyzed based on a pre‐defined scoring system of 10 criteria developed by the authors, deemed to contribute to a high‐quality clinical response to our question. Appendix S1 provides definitions for each criterion and references to related literature for each. One point was given to each criterion met, so that scores ranged from 0 to 10 for each response. Chatbot responses were scored by two blinded raters independently (JWP and ES). Discrepancies in the responses were discussed, and if a decision could not be agreed upon, they were further adjudicated by RF. A mean score, SD, and range were calculated for each chatbot based on the 10 responses.

The readability of responses was evaluated by the range of the Flesch–Kincaid Grade Level score using Microsoft Word. Interrater consistency was calculated for each criterion using prevalence‐adjusted and bias‐adjusted kappa (PABAK).

The consistency of responses was assessed via the magnitude of the SD of the mean usefulness scores and reading level grades. The raters also took descriptive notes to explain their scores and guide future research.

To assess chatbot capacity (for the exploratory aim), we scored the usefulness and readability of each chatbot once the scoring criteria were provided as context.

3. Results

3.1. Usefulness

Figure 1 contains a summary of the usefulness scores for each chatbot, compared to the data from 1 year earlier. Inter‐assessor reliability PABAK values for each scoring criterion ranged from 0.64 to 1.00. Our findings revealed that across all chatbots, between 6 and 9 of the ten predetermined criteria were met (mean (SD) = 7.6 (1.2)) for high‐quality clinical responses, when prompted about the reality of chronic pain. Compared to one year ago, the overall average improvements, particularly Claude (from 3.1 to 8.3) and Copilot (6.8 to 8.8), are notable. No chatbots refused to respond to the prompt. All individual responses are presented in Appendix S2.

FIGURE 1.

FIGURE 1

Mean total usefulness scores of each of the chatbots when prompted 10 times. The lighter green is first score on 1 attempt when given the marking criteria.

3.2. Readability

Figure 2 summarizes the readability scores for each chatbot, compared to the data from 1 year earlier. Across all of the chatbots, readability scores ranged from grade 5 to 23 (mean (SD) = 10.8 (3.7)). Compared to 1 year ago, responses were generally more child‐friendly.

FIGURE 2.

FIGURE 2

Mean readability scores of each of the chatbots when prompted 10 times, with a red horizontal line indicating grade 7 reading level. The lighter blue is first score on one attempt when given the marking criteria.

3.3. Consistency

The small SD of the mean usefulness scores within the ten responses from each chatbot in 2025 (SD range = 0.4–0.9, compared to 0.4–2.6 in 2024) suggests that the ten responses they outputted were similarly useful. The larger SD of the reading level grades within the ten responses from each chatbot in 2025 (SD range = 1.7–4.6, compared to 1.2–5.2 in 2024) demonstrates the ongoing fluctuations in the readability of responses with repeated prompting. Descriptive observations by the raters included: (1) ChatGPT‐4o often varied its headings and structure, starting with “sorry to hear” only 4 out of 10 times and occasionally using emojis; (2) Gemini 2.0 consistently started with the same phrase but never included “sorry to hear”; (3) Claude Sonnet was mostly consistent in saying “sorry to hear” (9 out of 10 times), mentioned “not alone” twice, was child‐friendly 3 times out of 10; and (4) Copilot provided two identical responses (#3 and #7) and very similar responses (#1 and #8), but was otherwise consistent in length, tone, and structure, covered “not alone” 9 out of 10 times, and consistently had a word count of 100–150 words, which was shorter than the other chatbots.

3.4. Capacity

For our exploratory aim, the usefulness of each chatbot immediately improved to 10/10 once the scoring criteria were provided as context (Figure 1). Readability also improved to below grade 7 for all of the chatbots (Figure 2).

4. Discussion

4.1. Key Findings

In the context of patient‐centered communication, AI chatbots improved in their usefulness, readability, and consistency from January 2024 to January 2025. The results are all approaching ceiling effects. Beyond this, when the explicit scoring criteria were provided, the AI chatbots improved further. Our benchmarks were immediately reached (10/10 usefulness, reading level grade < 7). These findings raise important new questions, such as “Should we move the goalposts?” and “What role do we want AI chatbots to play in healthcare?” Despite a big reduction in the risks of AI chatbot “hallucinations” and “falsehood mimicry” [8] that were inherent in AI responses 1 year earlier, we currently advocate for a cautious approach to future implementation because AI chatbots are changing rapidly.

4.2. Comparison to Previous Literature

Our findings align with recent data showing that AI chatbots can provide high‐quality, empathetic responses to patient inquiries, sometimes surpassing human physicians [9, 10]. The idea that structured prompts can enhance the quality of chatbot‐generated health information is supported by recent literature emphasizing the importance of prompt engineering in AI interactions [11]. It is “early days” in evaluating current chatbots [12], but our promising findings are largely comparable to findings in other health fields [13, 14].

4.3. Strengths and Limitations

One strength of this study is that all data is provided in Appendix S2. A key limitation of this line of research is the potential for rapid changes in AI models that may outpace our evaluation. A limitation to generalizability is our focus on a single prompt, which may not capture the full extent of GenAI chatbot usefulness, readability, and capacity. A further limitation is that we did not involve end‐users (children and their families) in the study.

4.4. Clinical Implications

New, more detailed benchmarks to compare AI chatbot capabilities to human health professionals could inform implementation research directions such as family‐facing platforms, integration into clinical appointments, educational tools, and telehealth support. For example, some of the arbitrary thresholds to our scores did not allow for a nuanced analysis (e.g., a word count of 300 gets one point whereas a 301 does not, and a reading level of grade 7.1 is scored the same as the worst possible reading level). In addition, five of the 10 benchmarks had a ceiling effect (full scores of 1/1 for these items): responses always (A) directly addressed the concern of the reality of pain, (B) explored the multifaceted nature of pain, (C) referred the prompter to a health professional, (D) aligned broadly with the evidence, and (E) had a coaching tone. To better distinguish between more and less useful responses, modifications to the benchmarks should be made [15]. Raised benchmarks will logically improve real‐world usability. During the rating process, the raters identified that Claude's responses appeared to have a more curious tone than Copilot's, but the current usefulness scoring system did not distinguish between them. A Likert scale scoring system may provide more granular information about ‘how much’ a chatbot response achieves a particular benchmark. Further to this, not all criteria are equally important for a high‐quality response. For example, addressing the biopsychosocial nature of pain could be more essential than a brief expression of sympathy. Assigning weights to criteria (e.g., “coaching tone” could be worth more than word count) could better reflect real‐world usefulness and their relative importance in evaluating responses. For example, a chatbot that acknowledges a patient's distress with a validating, conversational tone may be perceived as more helpful by families, even if the core information remains unchanged. Broader criteria could also be broken down into measurable sub‐criteria, for example, levels of validation and/or invalidation could be scored [16]. Once refined benchmarks are developed and tested, relevant clinical trials can be conducted [17], and ethical considerations such as potential misinterpretations and data privacy need to be explored.

4.5. Future Research

Multiple prompts within a single conversation should be evaluated to better understand chatbot consistency and adaptability [18]. Future research could also examine the consistency of guided responses by providing more context. Given that all four chatbots achieved a 10/10 usefulness score immediately when given context, AI‐based systems with clear guidance could now be developed and tested so that the clinical effect of such technology can be determined [19] for this particular scenario of a complex, short query from a child. Another avenue to explore is to analyze longer form, referenced outputs of new “thinking” chatbots that take time reasoning (e.g., “ChatGPT o1,” and “Google Gemini Pro 1.5 with Deep Research”). As AI systems evolve to include multimodal outputs, evaluating their performance in delivering information through audio and video, potentially via virtual avatars or AI‐generated health professionals, will be important [20, 21]. Research in real‐world settings to examine user engagement and trust, and research in languages other than English, is also needed.

5. Conclusion

The substantial improvements in GenAI chatbot performance over the past 12 months highlight their potential in interactive health information dissemination. Ongoing evaluation and cautious integration into healthcare settings are advised.

Funding

The authors have nothing to report.

Conflicts of Interest

J.W.P. has received speaker fees for presentations on pain and rehabilitation. He receives royalties for educational children's books. S.D.T. is supported by a Sir Randal Heymanson Fellowship awarded through the University of Melbourne. Other authors have no conflicts of interest to disclose.

Supporting information

Appendix S1: Explanations for the 10 criteria and supporting references.

PNE2-8-e70023-s001.docx (15.9KB, docx)

Appendix S2: All of the chatbot responses and their scoring once discrepancies between blinded raters were resolved.

PNE2-8-e70023-s003.xlsx (29.4KB, xlsx)

Appendix S3: TRIPOD‐LLM Reporting Checklist.

PNE2-8-e70023-s002.pdf (692.6KB, pdf)

Pate J. W., Fechner R., Tagliaferri S. D., Schneider E., and Saragiotto B., “A Longitudinal Analysis of the Usefulness, Readability, Consistency, and Capacity of Artificial Intelligence Chatbot Responses Regarding the Reality of Chronic Pain in Children,” Paediatric and Neonatal Pain 8, no. 1 (2026): e70023, 10.1002/pne2.70023.

Data Availability Statement

All data relevant to the study are included in the article.

References

  • 1. Heidt A., “Without These Tools, I'd Be Lost': How Generative AI Aids in Accessibility,” Nature 628, no. 8007 (2024): 462–463. [DOI] [PubMed] [Google Scholar]
  • 2. Ozduran E. and Büyükçoban S., “Evaluating the Readability, Quality and Reliability of Online Patient Education Materials on Post‐Covid Pain,” PeerJ 10 (2022): e13686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Pate J. W., Fechner R., Tagliaferri S. D., Leake H., and Saragiotto B., “Prompt Again: How Consistently Useful Are Artificial Intelligence Chatbot Responses When Prompted With Concerns About the Reality of Paediatric Chronic Pain?,” Paediatric and Neonatal Pain 6, no. 4 (2024): 111–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hintz E. A., “It's All in Your Head: A Meta‐Synthesis of Qualitative Research About Disenfranchising Talk Experienced by Female Patients With Chronic Overlapping Pain Conditions,” Health Communication 38, no. 11 (2023): 2501–2515. [DOI] [PubMed] [Google Scholar]
  • 5. Wang L., Bhanushali T., Huang Z., Yang J., Badami S., and Hightow‐Weidman L., “Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations,” JMIR Ment Health 12 (2025): e70014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ignjatović A., Anđelković Apostolović M., Stevanović L., et al., “Chatgpt's Progress Over Time: A Longitudinal Enhancing Biostatistical Problem‐Solving in Medical Education,” Health Informatics Journal 31, no. 3 (2025): 14604582251381260. [DOI] [PubMed] [Google Scholar]
  • 7. Gallifant J., Afshar M., Ameen S., et al., “The TRIPOD‐LLM Reporting Guideline for Studies Using Large Language Models,” Nature Medicine 31, no. 1 (2025): 60–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lee P., Bubeck S., and Petro J., “Benefits, Limits, and Risks of GPT‐4 as an AI Chatbot for Medicine,” New England Journal of Medicine 388, no. 13 (2023): 1233–1239. [DOI] [PubMed] [Google Scholar]
  • 9. Ayers J. W., Poliak A., Dredze M., et al., “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum,” JAMA Internal Medicine 183, no. 6 (2023): 589–596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Brodeur P. G., Buckley T. A., Kanjee Z., et al., “Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician,” arXiv Preprint arXiv:241210849 (2024). [Google Scholar]
  • 11. Ward E. and Gross C., “Evolving Methods to Assess Chatbot Performance in Health Sciences Research,” JAMA Internal Medicine 183, no. 9 (2023): 1030–1031. [DOI] [PubMed] [Google Scholar]
  • 12. Alsharhan A., Al‐Emran M., and Shaalan K., “Chatbot Adoption: A Multiperspective Systematic Review and Future Research Agenda,” IEEE Transactions on Engineering Management 71 (2024): 10232–10244. [Google Scholar]
  • 13. Puerto Nino A. K., Garcia Perez V., Secco S., et al., “Can ChatGPT Provide High‐Quality Patient Information on Male Lower Urinary Tract Symptoms Suggestive of Benign Prostate Enlargement?,” Prostate Cancer and Prostatic Diseases 28 (2024): 167–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Syvänen S. and Valentini C., “The Diffusion of Chatbot Research Across Disciplines: A Systematic Literature Review,” in Organizational Communication in the Digital Era: Examining the Impact of AI, Chatbots, and Covid‐19, ed. Ndlela M. N. (Springer Nature Switzerland, 2024), 11–49. [Google Scholar]
  • 15. Haug C. J. and Drazen J. M., “Artificial Intelligence and Machine Learning in Clinical Medicine, 2023,” New England Journal of Medicine 388, no. 13 (2023): 1201–1208. [DOI] [PubMed] [Google Scholar]
  • 16. Fruzzetti A., Validating and Invalidating Behavior Coding Scale Manual (Report, University of Nevada, Reno, 2010). [Google Scholar]
  • 17. Rathinam A. K., Lee Y., Ling D. N. C., Singh R., and Selvaratnam L., Artificial Intelligence in Medicine: A Review of Challenges in Implementation and Disparity (2021 IEEE International Conference on Health, Instrumentation & Measurement, and Natural Sciences (InHeNce), 2021). [Google Scholar]
  • 18. Yu H. and McGuinness S., “An Experimental Study of Integrating Fine‐Tuned LLMs and Prompts for Enhancing Mental Health Support Chatbot System,” Journal of Medical Artificial Intelligence (2024): 1–16. [Google Scholar]
  • 19. Aggarwal A., Tam C. C., Wu D., Li X., and Qiao S., “Artificial Intelligence–Based Chatbots for Promoting Health Behavioral Changes: Systematic Review,” Journal of Medical Internet Research 25 (2023): e40789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Buckley T. A., Diao J. A., Rajpurkar P., Rodman A., and Manrai A. K., Multimodal Foundation Models Exploit Text to Make Medical Image Predictions (2023).
  • 21. Jin Q., Chen F., Zhou Y., et al., “Hidden Flaws Behind Expert‐Level Accuracy of Gpt‐4 Vision in Medicine,” arXiv Preprint arXiv (2024): 240108396. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1: Explanations for the 10 criteria and supporting references.

PNE2-8-e70023-s001.docx (15.9KB, docx)

Appendix S2: All of the chatbot responses and their scoring once discrepancies between blinded raters were resolved.

PNE2-8-e70023-s003.xlsx (29.4KB, xlsx)

Appendix S3: TRIPOD‐LLM Reporting Checklist.

PNE2-8-e70023-s002.pdf (692.6KB, pdf)

Data Availability Statement

All data relevant to the study are included in the article.


Articles from Paediatric & Neonatal Pain are provided here courtesy of Wiley

RESOURCES