Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 25.
Published in final edited form as: Laryngoscope. 2023 Dec 21;134(6):2757–2761. doi: 10.1002/lary.31243

Exploring the Role of Artificial Intelligence Chatbots in Preoperative Counseling for Head and Neck Cancer Surgery

Jason C Lee 1, Chelsea S Hamill 1, Yelizaveta Shnayder 1, Erin Buczek 1, Kiran Kakarala 1, Andrés M Bur 1
PMCID: PMC12374230  NIHMSID: NIHMS2102963  PMID: 38126511

Abstract

Objective:

To evaluate the potential use of artificial intelligence (AI) chatbots, such as ChatGPT, in preoperative counseling for patients undergoing head and neck cancer surgery.

Study Design:

Cross-Sectional Survey Study.

Setting:

Single institution tertiary care center.

Methods:

ChatGPT was used to generate presurgical educational information including indications, risks, and recovery time for five common head and neck surgeries. Chatbot-generated information was compared with information gathered from a simple browser search (first publicly available website excluding scholarly articles). The accuracy of the information, readability, thoroughness, and number of errors were compared by five experienced head and neck surgeons in a blinded fashion. Each surgeon then chose a preference between the two information sources for each surgery.

Results:

With the exception of total word count, ChatGPT-generated pre-surgical information has similar readability, content of knowledge, accuracy, thoroughness, and numbers of medical errors when compared to publicly available websites. Additionally, ChatGPT was preferred 48% of the time by experienced head and neck surgeons.

Conclusion:

Head and neck surgeons rated ChatGPT-generated and readily available online educational materials similarly. Further refinement in AI technology may soon open more avenues for patient counseling. Future investigations into the medical safety of AI counseling and exploring patients’ perspectives would be of strong interest.

Level of evidence:

N/A.

Keywords: artificial intelligence, chatbot, ChatGPT, head and neck surgery, patient counseling

INTRODUCTION

The treatment of head and neck cancer (HNC) has significant long-term physical, functional, and psychological effects on patients. Studies have shown that appropriate and timely patient education and counseling can improve patient satisfaction and overall quality of life in cancer survivors.1 Unfortunately, up to 93% of cancer patients report a lack of information before and after cancer treatment,2 which is correlated with post-treatment depression, anxiety, and reduced quality of life in HNC patients.3 Furthermore, a longitudinal study of HNC patients found that as many as four in five patients experienced decision regret 3 to 6 months after their cancer treatment.4 Despite its importance, preoperative counseling remains a significant challenge in the HNC population due to time constraints during encounters and the complexity of treatment.

Artificial intelligence (AI) has the potential to overcome critical barriers to preoperative counseling in HNC patients. AI is increasingly employed in medicine, ranging from diagnostics and treatment planning to drug development, thanks to its unparalleled capacity for pattern recognition using large volumes of data. In otolaryngology, for instance, AI and machine learning have been used for screening, diagnosis, and treatment decision in rhinology,5 otology,6 laryngology,7 and head and neck oncology.8 We have also previously demonstrated the utility of AI for predicting head and neck melanoma patients with a low risk of nodal metastasis9 and identifying patients with oral cavity squamous cell carcinoma at risk for occult nodal disease.10

Recently, OpenAI released its latest conversational AI chatbot, ChatGPT.11 The software was developed using deep learning and natural language processing.12 The platform was trained on a large amount of information from textbooks, websites, and scientific literature, totaling terabytes of data. The purpose of the platform is to develop a publicly available AI capable of understanding and generating human-like responses based on user prompts. Importantly, chatbots are freely available to the general public, and their interfaces are designed to be more user-friendly. In this study, we evaluated the potential and usefulness of AI chatbots, specifically ChatGPT, in pre-operative patient counseling for HNC procedures. We hypothesize that AI chatbots can provide timely and accurate information on surgical indications, risks, and recovery for various head and neck procedures.

METHODS

The following study consisted of surveying head and neck surgeons at our institution, and it was exempted from the Institutional Review Board at the University of Kansas Medical Center.

ChatGPT Bot

ChatGPT (https://chat.openai.com/chat) was used to generate presurgical patient educational information for five common head and neck surgeries, including glossectomy, parotidectomy, neck dissection, total laryngectomy, and free tissue transfer for head and neck reconstruction. Surgical information including indications, risks, and recovery time, was separately queried. The initial query prompt used was “explain the [surgical indications/risks and complications/recovery time] for [procedures]”. In addition, prompt modifiers such as “explain… to a medical professional/to a 6th grader” were added to examine the ability of ChatGPT to generate differential responses based on user input. Responses were recorded and used to construct individual excerpts for each separate procedure listed above for analysis and review (see supplemental material S1 for an example). For comparison, the most widely used search engine, Google, was used to directly search for patient educational information for each procedure separately. Key terms such as “total laryngectomy”, “parotidectomy”, “neck dissection”, “glossectomy”, and “free tissue transfer for the head and neck” were used. Surgical educational information from the first publicly available website, excluding any peer-reviewed scholarly articles, was extracted and formatted similarly to its corresponding ChatGPT counterpart. This was separately labeled as “Web”. Three of the five websites were from R1 Research Institution from (ex: Cleveland Clinic), while the other two were from other tertiary centers that also routinely perform such procedures. Together, four distinct groups were created: (1) Web, (2) ChatGPT-Native, (3) ChatGPT-Professional, and (4) ChatGPT-6th grade. To ensure uniformity in this process, the lead author J.C.L, independently extracted information, which also ensured that our head and neck surgeons were kept blind to the process to prevent any bias for downstream analysis.

Readability Scoring

To assess readability, Flesch Reading Ease Score13 and Gunning-Fog Index,14 two widely utilized readability scales;), were used to calculate the respective difficulty of reading each generated excerpts. The Flesch Reading Ease Score is a 100-point scale. The higher the score equates to easier comprehension. A reading score of 60 to 70 is roughly equivalent to 8–9th grade reading comprehension level.15 Similarly, the Gunning-Fog index corresponds to US academic grade reading comprehension ability.16 It is calculated by using the following formula: 0.4 × [(word/sentences) + 100 × (complex words /total words)], where complex words are defined as words containing three or more syllables.16 Unlike the Flesch Reading Ease Score, the lower the score on the Gunning Fox Index equates to easier comprehension. Word count was also recorded from the raw, unblinded output of each source.

Assessment by Head and Neck Surgeon

Sample excerpts were delabeled and blinded for independent review by five head and neck surgeons. Specifically, only ChatGPT-6th grade excerpts were used for direct side-by-side comparison with “Web” excerpts during the review process. Author J.C.L independently extracted each excerpt without altering the content information. Excerpts were standardized to bullet point format to help with blinding. For instance, ChatGPT-generated output all included bullet points while some web sources did not. Additionally, ChatGPT-generated output all included filler phrases, such as “it is important to discuss risks and benefits with your health care professional”, that were removed without changing the content of the excerpt. All bullet points were also formatted to standard complete sentences. Each correspondent was then asked to grade each excerpt on a scale of 1 to 10 in terms of accuracy and thoroughness, with 1 being not very accurate or thorough and 10 being very accurate or thorough, respectively. In addition, each respondent was also asked to record the numbers of medical errors. Finally, respondents were asked to choose their preferred excerpt for each procedure.

Intraclass Correlation Coefficient

To determine the inter-rater reliability for more than two raters, an intraclass correlation coefficient was calculated as described by Koo and Li.17 A score of less than 0.5 indicates poor reliability. A score between 0.5 and 0.75 indicates good moderate reliability. A score between 0.75 and 0.9 indicates good reliability. A score greater than 0.9 indicates excellent reliability.

Statistical Analysis

Analysis was performed using GraphPad Prism using One-way ANOVA, multiple t-tests, Wilcoxon Mann Whitney U test, or McNemar test where appropriate. Statistical significance was defined a priori as p value less than 0.05.

RESULTS

Four distinct types of educational information were generated for each of the five common head and neck surgeries: (1) Web, (2) ChatGPT-Native, (3) ChatGPT-Professional, and (4) ChatGPT-6th grade. The total word count, Flesch Reading Ease Score, and Gunning-Fog index were then compared using One-way ANOVA. Total word count of unedited excerpts generated from publicly available website was significantly lower than those generated from ChatGPT, regardless of prompts (Fig. 1A, p < 0.01). “Web” and “ChatGPT-6th grade” have very similar Flesch Reading Ease Scores (Fig. 1B). Furthermore, both scores were significantly higher when compared to that of “ChatGPT-Native” and “ChatGPT-Professional,” suggesting that “Web” and “ChatGPT-6th grade” are easier to read (Fig. 1B, p < 0.05). This was also confirmed by a separate readability scale using the Gunning-Fog index (Fig. 1C, p < 0.01).

Fig. 1.

Fig. 1.

Comparison of ChatGPT-generated texts to web browsing from freely available websites. (A) ChatGPT-generated tests are significantly wordier. (B and C) addition of prompt modifiers, such as “explain… using 6th grade reading comprehension” resulted in improved readability. (D) ChatGPT-generated texts have similar content of knowledge regardless of prompt modifiers. ns, not significant with p > 0.05; * p ≤ 0.05; ** p ≤ 0.01; ***, p ≤ 0.001.

Next, we were interested to see if input modifier would affect response from ChatPGT. For instance, would “explain… to a medical professional” contain additional critical information as compared to “explain… at a 6th-grade level.” We specifically investigated the risks and complications output for each group to determine whether different prompts affect the detail of output. We felt that procedure-specific risks and complications would offer the most direct and in-depth differentiation for content of knowledge generated. To achieve this, an experienced head and neck surgeon reviewed and approved a list of procedure-specific risks (supplemental material 2). These procedure-specific risks were extracted initially from the National Institute of Health StatPearls for each procedure and further reviewed by a senior head and neck surgeon. We then determined the percentage of time procedure-specific risks that discussed by each output. On average, web resources covered 54.3% of the procedure-specific risks, while ChatGPT-Native, ChatGPT-Professional, and ChatGPT-6th grade covered 42.9%, 40.0%, and 40.0%, respectively (Fig. 1D, p = 0.23), suggesting that prompt modifiers did not affect the content of knowledge discussed by the chatbot. Importantly, “ChatGPT-6th grade” and “Web” showed similar content of knowledge as well as readability, prompting us to select them for further analysis in the following section.

“Web” and “ChatGPT-6th grade” generated comprehensive excerpts that included indication, risks, and recovery were then selected for independent review by five different head and neck surgeons. The review process was blinded so that each surgeon was unaware of whether surgical information was taken from publicly available websites or generated by ChatGPT. For the purpose of blinding, the “Web” and “ChatGPT-6thgrade” were edited as outlined in the method section. Four different parameters were assessed: accuracy, thoroughness, number of medical errors, and surgeon preference. We found that surgical information generated by ChatGPT has similar medical accuracy when compared to publicly available websites (mean of 7.7 and 8.1, respectively; p = 0.42; Fig. 2A). ChatGPT-generated information has similar thoroughness (mean of 7.5 and 7.3 respectively; p = 0.76; Fig. 2B). Additionally, both groups have limited numbers of medical errors (mean of 1.2 and 1.0, respectively; p = 0.30; Fig. 2C). To determine inter-rater reliability from individual head and neck surgeons, an intraclass correlation coefficient was calculated based on composite scoring from each surgeon. The coefficient was 0.85, indicating good inter-rater reliability. Each surgeon was then asked to select personal preference between the two outputs. Overall, there was an even split between “Web” (52%) and “ChatGPT-6th grade” (48%). Finally, three out of five “Web” excerpts were extracted from R1 Research Institution websites. There was no difference in surgeon preference for these specific excerpts over ChatGPT-generated responses based on the McNemar test (p = 0.68, odds ratio 0.5).

Fig. 2.

Fig. 2.

Assessment of medical accuracy, thoroughness, and number of medical errors by head and neck surgeons. ChatGPT-generated texts scored similarly to publicly available websites in terms of accuracy (A), thoroughness (B), and number of medical errors (C). ns, not significant with p > 0.05; * p ≤ 0.05; ** p ≤ 0.01; *** p ≤ 0.001.

DISCUSSION

Many HNC patients report inadequate pre-operative counseling about treatment side effects and long-term impact on quality of life.3

Here we demonstrated the feasibility of using AI chatbots, specifically ChatGPT, to provide pre-surgical information for common head and neck procedures. Our results showed that ChatGPTG could provide generally accurate and thorough surgical information without significant medical errors. In addition, ChatGPT-generated responses were rated similarly to available online resources by experienced head and neck surgeons. Importantly, we showed that ChatGPT prompts could be tailored to successfully explain relatively complex medical knowledge in simple terms. Together, these findings suggest that the use of AI chatbot in pre-operative counseling may help to address critical barriers that exist in HNC patient education and counseling.

We found that general operative risks such as pain, bleeding, infection, and damage to nearby structures were well discussed by chatbots. We further measured depth of knowledge by investigating procedure-specific risks and complications and found that both ChatGPT responses and online resources presented surgical information with similar content of knowledge.

Many prior studies have demonstrated a positive correlation between multimedia use in pre-operative patient counseling and overall patient satisfaction and learning.18,19 Admittedly, an AI chatbot may fall short when compared to an intentionally designed quality educational video. There are a couple key advantages to AI chatbot. The first and foremost is the dynamic and interactive nature of AI chatbot. Patients can freely direct questions according to their own desires. Secondly, the flexibility of a chatbot is superior to the linear nature of a video. To apply AI chatbots to a clinical setting, clinicians could consider integrating private dedicated computer stations for patients to use after the clinic visit or at home. Voice-integrated software extension such as “VoiceGPT” could be utilized to obtain voice input and output from the chatbot, facilitating a truly conversation-like interaction with AI. Alternatively, mobile applications with chatbot voice integration, such as ChatGPT Voice, could be used by more technically adept HNC patients. It should also be noted that AI chatbots are becoming more advanced and that pictures and video editing are now possible using chatbots. It is foreseeable that a “multimedia” chatbot that incorporates images and/or videos while responding to a patient prompt could serve as a powerful tool for patient education. Nevertheless, the impact of pre-operative education using AI chatbot on patient quality of life and satisfaction remains to be investigated.

Our study is limited by surveying only head and neck surgeons from a single institution and lacks assessment of patients’ perspectives. However, in practice, employing an AI chatbot in the clinical setting may prove to be difficult. For instance, AI chatbots sometimes display “Artificial Hallucination”, in which AI appears to confidently provide erroneous information.20 Such misinformation could potentially be dangerous or at worst life-threatening. The legality of using AI in medicine is just beginning to be scrutinized.21 Should an adverse event arise, who should ultimately be responsible for the misrepresentation? Another shortcoming of the methodology is that only a single web resource was used for evaluation per prompt. It is conceivable that some web resources may be more much more detailed than others. We chose the first publicly available website to best mimic its accessibility from a patient’s point-of-view. Another potential drawback of using AI is content bias and the quality of input data.22,23 Indeed, many references cited by ChatGPT are often made-up and untraceable when queried.20 This is due to the fact that the language platform was designed to predict the most likely text output without truly understanding the semantics or truthfulness of the information.24

At present, chatbots should not be recommended as sources of educational information for patients undergoing head and neck surgery. Importantly, chatbots should not be viewed as a replacement for quality face-to-face time with surgeons. Instead, AI chatbot may serve as a supplemental tool that would augment patients’ understanding of the nature of the surgery. Nevertheless, AI chatbot technology is in its infancy. With the increasing popularity of AI chatbots, many technology companies are racing to develop the next generation of “smarter” chatbot platform. With further advancement and refinement in AI technology, we may soon witness a new era of hybrid medical care where physicians work synergistically with AI in both the clinical and surgical settings to enhance health care delivery for HNC patients and beyond.

CONCLUSION

Presurgical information generated by chatbots is equivalent to google searches in terms of accuracy, thoroughness, and errors when assessed by experienced head and neck surgeons. Furthermore, input modifiers to alter readability showed significant changes on the readability scale without omitting the content of knowledge. Future studies should investigate the familiarity and ease of access to chatbots and how supplementation of pre-operative education using chatbots affects the quality of life in HNC patients.

Supplementary Material

Supplemental 1
Supplemental 2

Additional supporting information may be found in the online version of this article.

Footnotes

The authors have no funding, financial relationships, or conflicts of interest to disclose.

BIBLIOGRAPHY

  • 1.Husson O, Mols F, van de Poll-Franse LV. The relation between information provision and health-related quality of life, anxiety and depression among cancer survivors: a systematic review. Ann Oncol. 2011;22(4):761–772. 10.1093/annonc/mdq413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Harrison JD, Young JM, Price MA, Butow PN, Solomon MJ. What are the unmet supportive care needs of people with cancer? A systematic review. Support Care Cancer. 2009;17(8):1117–1128. 10.1007/s00520-009-0615-5. [DOI] [PubMed] [Google Scholar]
  • 3.Llewellyn CD, McGurk M, Weinman J. How satisfied are head and neck cancer (HNC) patients with the information they receive pre-treatment? Results from the satisfaction with cancer information profile (SCIP). Oral Oncol. 2006;42(7):726–734. 10.1016/j.oraloncology.2005.11.013. [DOI] [PubMed] [Google Scholar]
  • 4.Nallani R, Smith JB, Penn JP, et al. Decision regret 3 and 6 months after treatment for head and neck cancer: observational study of associations with clinicodemographics, anxiety, and quality of life. Head Neck. 2022;44(1):59–70. 10.1002/hed.26911. [DOI] [PubMed] [Google Scholar]
  • 5.Amanian A, Heffernan A, Ishii M, Creighton FX, Thamboo A. The evolution and application of artificial intelligence in rhinology: a state of the art review. Otolaryngol Head Neck Surg. 2022;169(1):21–30. 10.1177/01945998221110076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.You E, Lin V, Mijovic T, Eskander A, Crowson MG. Artificial intelligence applications in otology: a state of the art review. Otolaryngol Head Neck Surg. 2020;163(6):1123–1133. 10.1177/0194599820931804. [DOI] [PubMed] [Google Scholar]
  • 7.Hu HC, Chang SY, Wang CH, et al. Deep learning application for vocal fold disease prediction through voice recognition: preliminary development study. J Med Internet Res. 2021;23(6):e25247. 10.2196/25247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Mahmood H, Shaban M, Rajpoot N, Khurram SA. Artificial intelligence-based methods in head and neck cancer diagnosis: an overview. Br J Cancer. 2021;124(12):1934–1940. 10.1038/s41416-021-01386-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Oliver JR, Karadaghy OA, Fassas SN, Arambula Z, Bur AM. Machine learning directed sentinel lymph node biopsy in cutaneous head and neck melanoma. Head Neck. 2022;44(4):975–988. 10.1002/hed.26993. [DOI] [PubMed] [Google Scholar]
  • 10.Farrokhian N, Holcomb AJ, Dimon E, et al. Development and validation of machine learning models for predicting occult nodal metastasis in early-stage oral cavity squamous cell carcinoma. JAMA Netw Open. 2022;5(4):e227226. 10.1001/jamanetworkopen.2022.7226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.OpenAI. GPT-4 Technical Report. ArXiv. 2023;abs/2303.08774. [Google Scholar]
  • 12.Radford A, Narasimhan K. Improving Language Understanding by Generative Pre-Training. 2018. [Google Scholar]
  • 13.Kincaid P, Fishburne RP, Rogers RL, Chissom BS. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. 1975. [Google Scholar]
  • 14.Gunning R The Technique of Clear Writing. McGraw-Hill; 1968. [Google Scholar]
  • 15.Flesch R A new readability yardstick. J Appl Psychol. 1948;32(3):221–233. 10.1037/h0057532. [DOI] [PubMed] [Google Scholar]
  • 16.Świeczkowski D, Kułacz S. The use of the Gunning fog index to evaluate the readability of polish and English drug leaflets in the context of health literacy challenges in medical linguistics: an exploratory study. Cardiol J. 2021;28(4):627–631. 10.5603/CJ.a2020.0142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Koo TK, Li MY. A guideline of selecting and reporting Intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–163. 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wisely CE, Robbins CB, Stinnett S, Kim T, Vann RR, Gupta PK. Impact of preoperative video education for cataract surgery on patient learning outcomes. Clin Ophthalmol. 2020;14:1365–1371. 10.2147/OPTH.S248080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Turkdogan S, Roy CF, Chartier G, et al. Effect of perioperative patient education via animated videos in patients undergoing head and neck surgery: a randomized clinical trial. JAMA Otolaryngol Head Neck Surg. 2022;148(2):173–179. 10.1001/jamaoto.2021.3765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179. 10.7759/cureus.35179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Naik N, Hameed BMZ, Shetty DK, et al. Legal and ethical consideration in artificial intelligence in healthcare: who takes responsibility? Front Surg. 2022;9:862322. 10.3389/fsurg.2022.862322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dahmen J, Kayaalp ME, Ollivier M, et al. Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword. Knee Surg Sports Traumatol Arthrosc. 2023;31:1187–1189. 10.1007/s00167-023-07355-6. [DOI] [PubMed] [Google Scholar]
  • 23.Graf A, Bernardi RE. ChatGPT in research: balancing ethics, transparency and advancement. Neuroscience. 2023;515:71–73. 10.1016/j.neuroscience.2023.02.008. [DOI] [PubMed] [Google Scholar]
  • 24.Floridi L, Chiriatti M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 2020;30(4):681–694. 10.1007/s11023-020-09548-1. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental 1
Supplemental 2

RESOURCES