Skip to main content
Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine logoLink to Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine
. 2023 Dec 1;19(12):1989–1995. doi: 10.5664/jcsm.10728

Evaluating ChatGPT responses on obstructive sleep apnea for patient education

Daniel J Campbell 1,, Leonard E Estephan 1, Eric V Mastrolonardo 1, Dev R Amin 1, Colin T Huntley 1, Maurits S Boon 1
PMCID: PMC10692937  PMID: 37485676

Abstract

Study Objectives:

We evaluated the quality of ChatGPT responses to questions on obstructive sleep apnea for patient education and assessed how prompting the chatbot influences correctness, estimated grade level, and references of answers.

Methods:

ChatGPT was queried 4 times with 24 identical questions. Queries differed by initial prompting: no prompting, patient-friendly prompting, physician-level prompting, and prompting for statistics/references. Answers were scored on a hierarchical scale: incorrect, partially correct, correct, correct with either statistic or referenced citation (“correct+”), or correct with both a statistic and citation (“perfect”). Flesch–Kincaid grade level and citation publication years were recorded for answers. Proportions of responses at incremental score thresholds were compared by prompt type using chi-squared analysis. The relationship between prompt type and grade level was assessed using analysis of variance.

Results:

Across all prompts (n = 96 questions), 69 answers (71.9%) were at least correct. Proportions of responses that were at least partially correct (P = .387) or correct (P = .453) did not differ by prompt; responses that were at least correct+ (P < .001) or perfect (P < .001) did. Statistics/references prompting provided 74/77 (96.1%) references. Responses from patient-friendly prompting had a lower mean grade level (12.45 ± 2.32) than no prompting (14.15 ± 1.59), physician-level prompting (14.27 ± 2.09), and statistics/references prompting (15.00 ± 2.26) (P < .0001).

Conclusions:

ChatGPT overall provides appropriate answers to most questions on obstructive sleep apnea regardless of prompting. While prompting decreases response grade level, all responses remained above accepted recommendations for presenting medical information to patients. Given ChatGPT’s rapid implementation, sleep experts may seek to further scrutinize its medical literacy and utility for patients.

Citation:

Campbell DJ, Estephan LE, Mastrolonardo EV, Amin DR, Huntley CT, Boon MS. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J Clin Sleep Med. 2023;19(12):1989–1995.

Keywords: obstructive sleep apnea, artificial intelligence, patient education, sleep surgery


BRIEF SUMMARY

Current Knowledge/Study Rationale: Obstructive sleep apnea is a highly prevalent sleeping disorder linked to significant cardiovascular and neuropsychiatric complications, and education is a key component in identifying patients at risk and increasing treatment adherence. ChatGPT, an artificial intelligence chatbot, is the fastest-growing consumer application in history, and given that patients are increasingly using online resources for medical information, an evaluation of the chatbot’s responses to questions regarding obstructive sleep apnea is essential.

Study Impact: ChatGPT overall provides appropriate answers to most questions on obstructive sleep apnea patient education; however, initial chatbot prompting appears to influence the grade level of its answers and its ability to provide referenced citations. Given ChatGPT’s rapid implementation, sleep experts may seek to further investigate its medical literacy and utility to patients.

INTRODUCTION

Obstructive sleep apnea (OSA) is a common sleep disorder affecting approximately 15–30% of adult males and 5–15% of adult females in the United States.1 Proper management of OSA is crucial, because it is linked to serious cardiovascular and neuropsychiatric complications, such as hypertension, coronary heart disease, stroke, and depression.2,3 Patient education is a key component in identifying patients at risk for OSA and increasing adherence with prescribed management.4 While patient education historically was primarily accomplished by the physician, patients are increasingly turning to online resources for self-education. According to the National Cancer Institute’s Health Information National Trends Survey (HINTS), 84.1% of the US adult population used the Internet to look for health or medical information in 2022.5

ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI that has the capability of responding to complex queries in an interactive, conversation-based format. Released to the public in November 2022, it is the fastest-growing consumer application in history, having surpassed 100 million users by January 2023.6 ChatGPT 3.5 (GPT-3.5) is the chatbot model currently free and available to the public.7 Limited recent literature found that GPT-3.5 provided largely appropriate answers to common questions about cardiovascular disease prevention.8 Further, select health systems are already trialing integrating ChatGPT into their electronic medical record to create draft responses when communicating asynchronously with patients.9

Given the increasing use of online resources by patients and the rapid implementation of ChatGPT worldwide, this study seeks to evaluate the quality of GPT-3.5 responses to common questions about OSA for patient education. The specific objectives of this investigation are (1) to evaluate the quality and correctness of ChatGPT responses to questions regarding OSA patient education and (2) to determine whether initial chatbot prompting influences the correctness, grade level, or included references or statistics of its answers. This investigation into the interplay between ChatGPT and patient education marks the first of its kind in otolaryngologic and sleep medicine literature. It is particularly clinically relevant, because previous research has found highly accessed online patient education materials on YouTube, the second-most-visited website globally, to be markedly unreliable and inconsistent.1012 As patients and health care systems incrementally move toward AI integration, it is crucial to critically analyze the answers that highly used AI models will provide to patients.

METHODS

ChatGPT was queried 4 times with a set of 24 identical questions pertaining to OSA patient education. Each of the four 24-question queries generated a unique question-and-answer thread on GPT-3.5 (March 23, 2023, version), which is free and available to the public. Each thread differed by how ChatGPT was prompted prior to asking the first question: no prompting, patient-friendly prompting, physician-level prompting, and prompting for statistics and references. The specific prompts used are in Table 1. Questions were divided into the following 5 categories: epidemiology (1–3), diagnosis (4–9), complications (10–12), nonsurgical management (13–18), and surgical management (19–24). Questions were posed in an order intended to simulate a logical flow of conversation, because ChatGPT’s answers build upon previous question/answer queries.

Table 1.

ChatGPT prompt names and prompt provided.

Prompt Number Prompt Name ChatGPT Prompt Provided
1 No prompting No prompting.
2 Patient-friendly prompting I am a patient attempting to learn more about obstructive sleep apnea. I am going to ask you 24 questions pertaining to obstructive sleep apnea. Please use language that would be appropriate for my understanding, but do not compromise on the accuracy of your responses. Be as specific as possible in your answers.
3 Physician-level prompting I am a board-certified physician attempting to learn the most up to date information on obstructive sleep apnea. I am going to ask you 24 questions pertaining to obstructive sleep apnea. Please use language that would be appropriate for my expert-level understanding of medical concepts. Be as specific as possible in your answers.
4 Statistics/references prompting I am going to ask you 24 questions pertaining to obstructive sleep apnea. For each answer you provide, make sure that you include statistics, numbers, or calculations that are relevant. Your answers should come from published medical literature, which you should cite within your answers.

“Prompts” were placed immediately before the first question posed to ChatGPT. Each prompt was only entered once.

Answers were reviewed and graded based on medical accuracy and clinical appropriateness using the following 5-part hierarchical scale: incorrect, partially correct, correct, correct with either a statistic or citation referenced (“correct+”), or correct with both a statistic and citation reference (“perfect”). Partially correct answers were further subcategorized by their reason for the grade: too vague or incomplete. Responses were assessed by 2 otolaryngology resident physicians under the guidance of 2 experienced academic sleep surgeons. Existing peer-reviewed literature acted as the basis for determining answer correctness. In the setting of grade discrepancy, a third author decided on the grade. For each answer, the number of words, sentences, and syllables were collected to compute a Flesch–Kincaid (FK) grade level. The FK grade level corresponds to the estimated US grade level that the information is presented (eg, 10 is high school level and 14 is collegiate level).13 The total number and publication years of citations referenced in each ChatGPT answer were recorded.

Analyses of graded responses were performed to assess whether initial ChatGPT prompting was related to scoring outcome. Using 2-sided chi-squared tests, proportions of query responses either satisfying or not satisfying the following test thresholds were compared by prompt type: (1) at least partially correct, (2) at least correct, (3) at least correct+, and (4) perfect. Tests were performed with alpha at 0.05, and a post hoc Bonferroni correction was applied to subset each possible paired comparison. One-way matched analysis of variance tests were performed to assess the relationship of prompt type and question category to word count and FK grade level. Statistical analyses were conducted using IBM SPSS Version 28.0.1.0 (IBM, Armonk, New York), and a P value <.05 indicated statistical significance.

RESULTS

A representation of summative grading is presented in Figure 1. Across all prompt types (n = 96 questions), scoring frequencies were 1 (1.0%) incorrect, 26 (27.1%) partially correct, 29 (30.2%) correct, 27 (28.1%) correct+, and 13 (13.5%) perfect. Among the 29 partially correct answers, 24 (96%) were “incomplete” and 1 (4%) was “too vague.” Sixty-nine (71.9%) answers were at least correct (ie, correct, correct+, or perfect).

Figure 1. Colormap representation of graded ChatGPT responses by prompt type.

Figure 1

CPAP = continuous positive airway pressure, OSA = obstructive sleep apnea, UPPP = uvulopalatopharyngoplasty.

On chi-squared analysis, proportions of responses that were at least partially correct (P = .387) or correct (P = .453) did not differ significantly by prompt type (Table 2). Proportions of responses that were at least correct+ (P < .001) or perfect (P < .001) did differ significantly by prompt type. Within the chi-squared test using a correct+ threshold (Table 2, row 2), grading proportions differed significantly specifically for patient-friendly prompting and statistics/reference prompting on Bonferroni post hoc analysis (P < .05).

Table 2.

Chi-squared tests by prompt type and varying thresholds of response grade.

No Prompting Patient-Friendly Prompting Physician-Level Prompting Statistics and References Prompting P
Freq % Freq % Freq % Freq %
Correct with both statistic and reference <.0001
 Yes 1 4.3 0 0 0 0 12 50.0
 No 23 96.6 24 100 24 100 12 50.5
At least correct with either statistic or reference * * <.0001
 Yes 7 29.2 4 16.7 10 41.7 19 79.2
 No 17 70.8 20 83.3 14 58.3 5 20.8
At least correct .453
 Yes 16 66.7 15 62.5 19 79.2 19 79.2
 No 8 33.3 9 37.5 5 20.8 5 20.8
At least partially correct .387
 Yes 23 96.6 24 100 24 100 24 100
 No 1 4.3 0 0 0 0 0 0

Frequency (Freq) is presented as a fraction of the total number (n = 24) of questions from a given form number. Asterisk (*) denotes a subset of category whose column proportions differ significantly from each other at the .05 level on Bonferroni post hoc analysis.

Mean FK grade level by prompt type was 14.15 ± 1.59 (no prompt), 12.45 ± 2.32 (patient-friendly prompting), 14.27 ± 2.09 (physician-level prompting), and 15.00 ± 2.26 (statistics/references prompting). There was a significant association between prompt type and FK grade level (P < .0001). Specifically, patient-friendly prompting led to a significantly lower grade level compared to all other prompts (Figure 2). Mean word count per answer by prompt type was 232.0 ± 28.3 (no prompt), 195.7 ± 63.77 (patient-friendly prompting), 188.7 ± 57.24 (physician-level prompting), and 175.6 ± 61.0 (statistics/references prompting). There was also a significant association between prompt type and word count (P < .0001) on analysis of variance. Specifically, no prompting had a significantly larger word count than all other prompts (Figure 2).

Figure 2. Analysis of variance assessing prompt type by grade level and word count.

Figure 2

**P ≤ .01, ***P ≤ .001, ****P ≤ .0001.

Mean FK grade level by question category was 11.25 ± 1.48 (epidemiology), 13.89 ± 2.23 (diagnosis), 14.74 ± 1.75 (complications), 14.65 ± 2.37 (nonsurgical management), and 14.35 ± 1.75 (surgical management). There was a significant association between question category and FK grade level (P < .001). Specifically, epidemiology questions had a significantly lower grade level compared to all other question categories (Figure 3). Mean word count per answer by question category was 113.5 ± 54.1 (epidemiology), 196.5 ± 43.1 (diagnosis), 201.3 ± 33.6 (complications), 247.3 ± 46.7 (nonsurgical management), and 190.83 ± 54.1 (surgical management). There was also a significant association between question category and word count (P < .001). Specifically, epidemiology questions had a significantly lower word count than all other question categories; additionally, nonsurgical management questions had a significantly larger word count than that of diagnosis- and surgical management–related questions (Figure 3).

Figure 3. Analysis of variance assessing question category by grade level and word count.

Figure 3

**P ≤ .01, ***P ≤ .001, ****P ≤ .0001. Mgmt = management.

Across all prompts (n = 96 questions), ChatGPT provided 77 total references for its answers: 72 (93.5%) from published medical literature, 4 (5.2%) from academic medical organization websites (ie, American Academy of Sleep Medicine), and 1 (1.3%) from medical device websites (ie, Inspire Medical Systems). Statistics/references prompting provided 74 (96.1%) of the total references, followed by no prompting (2 references, 2.6%), physician-level prompting (1 reference, 1.3%), and patient-friendly prompting (0 references). Of the 72 published medical literature references, the median year of publication was 2013 (range: 1991–2021) (Figure 4).

Figure 4. Frequency distribution of references from ChatGPT by publication year.

Figure 4

A total of 72 references (n = 72) from published literature are included.

References provided specifically in support of incorrect and partially correct responses were further analyzed. Across all prompts (n = 96 questions), the single incorrect answer did not provide a reference (Figure 1). Of the 26 partially correct answers, 5 provided references and were all derived from the statistics/references prompting. Published medical literature made up the majority (15/16, 93.8%) of the references across these 5 answers; one response (6.25%) cited an academic medical organization website.

DISCUSSION

This investigation studied the quality of AI-generated chatbot responses to questions pertaining to OSA patient education. GPT-3.5 overall provided appropriate answers to most posed questions, and 71.9% of all answers queried were at least correct. Moreover, there was no significant difference on chi-squared analysis between how the chatbot was initially prompted and the proportion of answers that were at least correct (P = .453). Thus, patients querying ChatGPT are likely to at least obtain a clinically appropriate answer regardless of how or whether the chatbot is initially prompted. This is reassuring, because many preliminary ChatGPT users may not be aware that the AI can respond differently based on prior prompting.

However, eliciting statistics or references from the chatbot does appear to be influenced by initial prompting. Answers generated after the chatbot was specifically asked to provide relevant statistics and references for each answer (Table 1) consisted of a higher proportion of at least 1 statistic or reference (P < .001). Furthermore, 96.1% (74/77) of all the references came specifically from statistics/references prompted answers. Although adding a reference to an answer does not inherently increase its correctness, it does allow patients (and physicians) to verify the source and age of the information that the chatbot provided.

The majority (93.5%) of references ChatGPT provided were from published medical literature with an indexable digital object identifier. Although the publication year range was 30 years (1991–2021), over half of the studies were published in the last decade, indicating that the chatbot may have a preference toward newer literature. Currently, GPT-3.5 is trained on data only up to September 2021; thus, it cannot cite or make inferences on newer literature.7 However, GPT-4, the most recent iteration of ChatGPT which is available through a limited-use paid subscription, has the ability to access real-time, current Internet data with the use of “plug-ins.”7 Given the rapid user implementation of ChatGPT globally, this newer feature may provide an additional means for patients to access up-to-date medical literature.

The American Medical Association and the National Institutes of Health recommend that patient educational materials be written at the sixth- and eighth-grade reading level, respectively.14 Measured as FK grade level, the estimated US grade level of ChatGPT responses was well above the American Medical Association/National Institutes of Health recommendations. Although prompting for patient-friendly answers did result in a significantly lower grade level (P < .0001), this prompt’s mean FK grade level was still 12.45 ± 2.32, which is encroaching on a collegiate reading level. Moreover, additional analysis revealed that the grade level of epidemiology questions (11.23 ± 1.48) was significantly lower than that of other question categories. Although interesting, this finding is unlikely to be clinically significant, as epidemiologic information in general is less likely to use technical terminology and medical jargon than other question categories. Further investigations may seek to feed ChatGPT even more specific prompts, such as “provide your answer at the sixth-grade level,” to assess for improvements in readability.

Patients overall have demonstrated a clear interest in accessing medical information online over the past decade, with 84.1% of the US adult population endorsing such use last year.5 Currently at an estimated 1.16 billion users, ChatGPT is the fastest-growing online application in history, and it is very likely that a proportion of such patients will begin to use ChatGPT.15 YouTube is the second-most-visited website globally, and National Institutes of Health data suggest that 59% of Internet users watched health-related YouTube videos in 2022.5 Although comparison is based on different grading schema, GPT-3.5 appears to at least qualitatively outperform YouTube videos on OSA patient education and, when prompted, is more likely to provide references.11,16

When unsure about information, large language models (ie, ChatGPT) may confidently generate factually incorrect information, termed “AI hallucination.” While OpenAI suggests that GPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models by up to 40%, patients are currently more likely to use GPT-3.5 because its access is free and unrestricted.17 Although our analysis does suggest that GPT-3.5 provides overall adequate responses to basic OSA questions, the AI technology should be further scrutinized by medical professionals to ensure patients are indeed consuming accurate information on all related medical topics.

This investigation is not without limitations. Our initial set of 24 questions was created with the intention of concisely covering the breadth of OSA education; however, patients theoretically can ask an infinite number of questions, which may generate answers of varying correctness otherwise not reported in this investigation. Moreover, ChatGPT’s responses sequentially build off previous queries. As such, the order that questions are posed may affect the quality of the chatbot’s answers. We attempted to ask questions that simulate a logical flow of conversation; however, we understand that patients may ask questions in any order, potentially generating varying responses. Furthermore, we did not ask the chatbot multiple-step questions, which theoretically could generate more precise answers due to ChatGPT’s ability to build off previous queries. Future investigations may seek to focus on the ideal way to ask questions to generate the most precise answers. Additionally, our qualitative scoring schema of ChatGPT’s answers is not validated, and future investigations may seek to implement more extensive, validated grading criteria.

DISCLOSURE STATEMENT

All authors have seen and approved the manuscript. Work for this study was performed at Department of Otolaryngology – Head and Neck Surgery, Thomas Jefferson University Hospitals. The authors report no conflicts of interest.

ABBREVIATIONS

AI

artificial intelligence

FK

Flesch–Kincaid

GPT-3.5

ChatGPT 3.5

OSA

obstructive sleep apnea

REFERENCES


Articles from Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine are provided here courtesy of American Academy of Sleep Medicine

RESOURCES