Abstract
This qualitative study rates the level of understandability, actionability, and procedure-specific content in postoperative instructions generated from ChatGPT, Google Search, and Stanford University.
ChatGPT (generative pretrained transformer), an artificial intelligence–powered language model chatbot, has been described as an innovative resource for many industries, including health care.1 Lower health literacy and limited understanding of postoperative instructions have been associated with worse outcomes.2,3 While currently ChatGPT cannot supplant a human clinician, it can serve as a medical knowledge source. This qualitative study assessed the value of ChatGPT in augmenting patient knowledge and generating postoperative instructions for use in populations with low educational or health literacy levels.
Methods
We analyzed postoperative patient instructions for 8 common pediatric otolaryngologic procedures: tympanostomy tube placement, tonsillectomy and adenoidectomy, inferior turbinate reduction, tympanoplasty, cochlear implant, neck mass resection, microdirect laryngoscopy and bronchoscopy, and tongue-tie release. Stanford University Institutional Review Board deemed this study exempt from review and waived the informed consent requirement given the study design. We followed the SRQR reporting guideline.
Postoperative instructions were obtained from ChatGPT, Google Search, and Stanford University (hereafter, institution). This phrase was entered into ChatGPT: Please provide postoperative instructions for the family of a child who just underwent a [procedure]. Provide them at a 5th grade reading level. Similarly, this phrase was entered into Google Search: My child just underwent [procedure]. What do I need to know and watch out for? The first nonsponsored Google Search results were used for analysis. Results were extracted and blinded. To enable adequate blinding, we standardized all fonts and removed audiovisuals (eg, pictures). Two of us (N.F.A., Y.-J.L.) scored the instructions.
The primary outcome was the Patient Education Materials Assessment Tool–printable (PEMAT-P)4 score, which assessed the understandability and actionability of instructions for patients of different backgrounds and health literacy levels. As a secondary outcome, instructions were scored on whether they addressed procedure-specific items. We a priori generated a list of 4 items specific to each procedure that were deemed important for each instruction to mention; see the Table 1 footnote for these items.
Table 1. Understandability, Actionability, and Procedure-Specific Scores for Each Procedure.
| Procedure and Instructions source | PEMAT-P understandability score, % | PEMAT-P actionability score, % | Procedure-specific items score, %a |
|---|---|---|---|
| Tympanostomy tube placement | |||
| Institutionb | 91 | 80 | 100 |
| Google Search | 82 | 100 | 75 |
| ChatGPT | 82 | 80 | 100 |
| Tonsillectomy and adenoidectomy | |||
| Institutionb | 91 | 80 | 100 |
| Google Search | 82 | 100 | 100 |
| ChatGPT | 82 | 80 | 100 |
| Inferior turbinate reduction | |||
| Institutionb | 91 | 100 | 100 |
| Google Search | 82 | 80 | 75 |
| ChatGPT | 73 | 80 | 100 |
| Tympanoplasty | |||
| Institutionb | 91 | 100 | 100 |
| Google Search | 82 | 100 | 100 |
| ChatGPT | 82 | 80 | 100 |
| Cochlear implant | |||
| Institutionb | 91 | 100 | 100 |
| Google Search | 82 | 40 | 0 |
| ChatGPT | 82 | 20 | 100 |
| Neck mass resection | |||
| Institutionb | 91 | 80 | 75 |
| Google Search | 82 | 80 | 100 |
| ChatGPT | 82 | 80 | 75 |
| Microdirect laryngoscopy and bronchoscopy | |||
| Institutionb | 91 | 80 | 100 |
| Google Search | 82 | 80 | 75 |
| ChatGPT | 82 | 80 | 100 |
| Tongue-tie release | |||
| Institutionb | 91 | 100 | 100 |
| Google Search | 73 | 80 | 50 |
| ChatGPT | 82 | 80 | 100 |
Abbreviation: PEMAT-P, Patient Education Materials Assessment Tool–printable.
Reviewers analyzed whether each instruction discussed 4 items specific to each procedure. Tympanostomy tube placement: (1) follow-up/tube check, (2) otorrhea, (3) ear drops, (4) when to call a clinician. Tonsillectomy and adenoidectomy: (1) pain management, (2) what to do if there is bleeding, (3) oral hydration, (4) when to call a clinician. Inferior turbinate reduction: (1) what to do if there is bleeding, (2) nasal sprays, (3) pain management, (4) when to call a clinician. Microdirect laryngoscopy and bronchoscopy: (1) pain management, (2) difficulty breathing, (3) difficulty swallowing, (4) when to call a clinician. Neck mass resection: (1) pain management, (2) difficulty breathing/swallowing, (3) swelling, (4) when to call a clinician. Tympanoplasty: (1) dry ear precautions, (2) ear drops, (3) pain management, (4) when to call a clinician. Cochlear implant: (1) what to do if there is fever, (2) what to do if there is swelling, (3) pain management, (4) wound care. Tongue-tie release: (1) pain, (2) difficulty eating, (3) bleeding, (4) when to call a clinician.
Standardized postoperative instructions from Stanford University School of Medicine.
Scores were compared using 1-way analysis of variance and Kruskal-Wallis tests with η2 (90% CI) as the appropriate effect size.5 Analysis was performed February 6, 2023 using R, version 4 (R Core Team).
Results
Overall, understandability scores ranged from 73% to 91%; actionability scores, 20% to 100%; and procedure-specific items, 0% to 100% (Table 1). ChatGPT-generated instructions were scored from 73% to 82% for understandability, 20% to 80% for actionability, and 75% to 100% for procedure-specific items.
Institution-generated instructions consistently had the highest scores (Table 2). Understandability scores were highest for institution (91%) vs ChatGPT (81%) and Google Search (81%) instructions (η2, 0.86; 90% CI, 0.67-1.00). Actionability scores were lowest for ChatGPT (73%), intermediate for Google Search (83%), and highest for institution (92%) instructions (η2, 0.22; 90% CI, 0.04-0.55). For procedure-specific items, ChatGPT (97%) and institution (97%) instructions had the highest scores and Google Search had the lowest (72%) (η2, 0.23; 90% CI, 0-0.64).
Table 2. Comparison of ChatGPT, Google Search, and Institution Instructions.
| Scores, % | η2 (90% CI) | |||
|---|---|---|---|---|
| ChatGPT | Google search | Institutiona | ||
| PEMAT-P total | 78 | 81 | 91 | 0.52 (0.16-0.68) |
| PEMAT-P understandability | 81 | 81 | 91 | 0.86 (0.67-1.00) |
| PEMAT-P actionability | 73 | 83 | 92 | 0.22 (0.04-0.55) |
| Procedure-specific items | 97 | 72 | 97 | 0.23 (0-0.64) |
Abbreviation: PEMAT-P, Patient Education Materials Assessment Tool–printable.
Standardized postoperative instructions from Stanford University School of Medicine.
Discussion
Findings suggest that ChatGPT provides instructions that are helpful for patients with a fifth-grade reading level or different health literacy levels. However, ChatGPT-generated instructions scored lower in understandability, actionability, and procedure-specific content than Google Search– and institution-specific instructions. Despite these findings, ChatGPT may be beneficial for patients and clinicians, especially when alternative resources are limited.
Online search engines are common sources of medical information for the public: 7% of Google searches are health-related.6 However, ChatGPT has advantages over search engines: it is free, can be customized to different literacy levels, and provides succinct information. ChatGPT provides direct answers that are often well-written, detailed, and in if-then format, which give patients access to immediate information while waiting to reach a clinician.
Study limitations were that only a few procedures and resources were analyzed, and the analysis was performed only in English. ChatGPT limitations included lack of citations; inability of users to confirm the accuracy of the information or explore topics further; and a knowledge base with a 2021 end point, excluding the latest data, events, or practice.
Data Sharing Statement
References
- 1.Roose K. How ChatGPT kicked off an A.I. arms race. New York Times. February 3, 2023. Accessed February 5, 2023. https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-intelligence.html
- 2.Theiss LM, Wood T, McLeod MC, et al. The association of health literacy and postoperative complications after colorectal surgery: a cohort study. Am J Surg. 2022;223(6):1047-1052. doi: 10.1016/j.amjsurg.2021.10.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.De Oliveira GS Jr, McCarthy RJ, Wolf MS, Holl J. The impact of health literacy in the care of surgical patients: a qualitative systematic review. BMC Surg. 2015;15:86. doi: 10.1186/s12893-015-0073-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shoemaker SJ, Wolf MS, Brach C. The Patient Education Materials Assessment Tool (PEMAT) and User’s Guide. AHRQ Publication No. 14-0002-EF. Agency for Healthcare Research and Quality; 2013.
- 5.Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol. 2013;4:863. doi: 10.3389/fpsyg.2013.00863 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Drees J. Google receives more than 1 billion health questions every day. Becker’s Health IT. March 11, 2019. Accessed February 2023. https://www.beckershospitalreview.com/healthcare-information-technology/google-receives-more-than-1-billion-health-questions-every-day.html
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Sharing Statement
