Skip to main content
The Journal of Allergy and Clinical Immunology: Global logoLink to The Journal of Allergy and Clinical Immunology: Global
letter
. 2024 Dec 21;4(1):100391. doi: 10.1016/j.jacig.2024.100391

Reply

Simon Høj a, Howraman Meteran b,c,d
PMCID: PMC11759544  PMID: 39867745

To the Editor:

We appreciate the constructive feedback provided by Kleebayoon and Wiwanitkit1 on our recent article, “Evaluating the scientific reliability of ChatGPT as a source of information on asthma.”2 Their insights raise important considerations on the evaluation framework for assessing AI-driven models in health care, particularly the need for a robust and multifaceted approach.

First, we acknowledge their concerns regarding potential biases in question selection. Although thematic diversity is essential in fully evaluating any information source, the aim of our study was not to assess ChatGPT across an exhaustive set of topics but rather to examine its reliability in answering typical patient queries about asthma. These questions were carefully selected on the basis of frequency and relevance in clinical settings, ensuring that our assessment focused on information that ChatGPT is likely to provide to users. Expanding thematic scope is an interesting avenue for future research, yet for practical assessment of patient-pertinent information, we believe that our design is well justified.

Regarding the evaluation methodology, our accuracy rating scale was intentionally chosen for simplicity and ease of interpretation. Although we recognize that such a scale may not capture the full complexity of medical information, it remains a widely used metric in AI evaluation studies.3, 4, 5, 6 Future work could indeed benefit from integrating more detailed qualitative data to capture nuances in response quality. We also share the interest that Kleebayoon and Wiwanitkit1 have expressed in ensuring consistency across raters. We used a panel of 5 professionals with expertise in asthma care who independently evaluated responses, achieving moderate agreement as measured by the Fleiss kappa (κ = 0.42; P < .001). However, further standardization could enhance interrater reliability.

In response to the suggestion calling for follow-up question analysis, we emphasize that our study’s primary objective was to assess the performance of ChatGPT on first-response accuracy, as initial responses are most critical in health care information settings. Introducing follow-up interactions, although insightful, would shift the focus from the ability of ChatGPT to provide reliable primary answers to a more complex analysis of AI response progression. This approach was beyond our study’s scope, which prioritized assessing initial accuracy and the potential for patient comprehension. Nonetheless, future studies may indeed benefit from evaluating response sequences to examine depth and coherence over extended interactions.

Additionally, we support the recommendation of Kleebayoon and Wiwanitkit1 calling for exploring improvements in model training, such as creating a feedback loop for continuous refinement based on expert input. This approach aligns with evolving AI applications in health care, in which real-time learning and updates could strengthen the reliability of AI models as information sources. Expanding ChatGPT’s language accessibility for broader global use, as suggested, also reflects an important direction for AI development.

The potential role of AI in health care communication is promising yet complex. We appreciate the emphasis that Kleebayoon and Wiwanitkit1 place on exploring AI's broader role in the doctor-patient relationship and its capacity for facilitating informed decision making. Incorporating studies of patient comprehension and longitudinal assessments, as suggested, would undoubtedly enhance the rigor of future AI evaluations in health care.

We are grateful for the opportunity to respond to these valuable points raised by Kleebayoon and Wiwanitkit.1

Disclosure statement

Disclosure of potential conflict of interest: H. Meteran reports receiving honoraria for lectures or advisory board meetings from GSK, Teva, Novartis, Sanofi-Aventis, Airsonett AB, and ALK-Abelló Nordic A/S within the past 5 years, unrelated to this study. H. Meteran has received research grants from ALK-Abelló A/S outside this study.

References

  • 1.Kleebayoon A., Wiwanitkit V. ChatGPT as a source of information on asthma. J Allergy Clin Immunol Glob. 2025;4 doi: 10.1016/j.jacig.2024.100390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Høj S., Thomsen S.F., Ulrik C.S., Meteran H., Sigsgaard T., Meteran H. Evaluating the scientific reliability of ChatGPT as a source of information on asthma. J Allergy Clin Immunol Glob. 2024;3 doi: 10.1016/j.jacig.2024.100330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Potapenko I., Boberg-Ans L.C., Stormly Hansen M., Klefter O.N., van Dijk E.H.C., Subhi Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol. 2023;101:829–831. doi: 10.1111/aos.15661. [DOI] [PubMed] [Google Scholar]
  • 4.Goodman R.S., Patrinely J.R., Stone C.A., Zimmerman E., Donald R.R., Chang S.S., et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6 doi: 10.1001/jamanetworkopen.2023.36483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Høj S., Thomsen S.F., Meteran H., Sigsgaard T., Meteran H. (2023a). Artificial intelligence and allergic rhinitis: does ChatGPT increase or impair the knowledge? J Public Health (Oxf) 2023;46:123–126. doi: 10.1093/pubmed/fdad219. [DOI] [PubMed] [Google Scholar]
  • 6.Mohammadi S.S., Khatri A., Jain T., Thng Z.X., Yoo W sun, Yavari N., et al. Evaluation of the appropriateness and readability of ChatGPT-4 responses to patient queries on uveitis. Ophthalmol Sci. 2024;5 doi: 10.1016/j.xops.2024.100594. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of Allergy and Clinical Immunology: Global are provided here courtesy of Elsevier

RESOURCES