Abstract
Objective
To evaluate the clinical applications and limitations of chat generative pretrained transformer (ChatGPT) in otolaryngology.
Study Design
Cross‐sectional survey.
Setting
Tertiary academic center.
Methods
ChatGPT 4.0 was queried for diagnoses and management plans for 20 physician‐written clinical vignettes in otolaryngology. Attending physicians were then asked to rate the difficulty of the clinical vignettes and agreement with the differential diagnoses and management plans of ChatGPT responses on a 5‐point Likert scale. Summary statistics were calculated. Univariate ordinal regression was then performed between vignette difficulty and quality of the diagnoses and management plans.
Results
Eleven attending physicians completed the survey (61% response rate). Overall, vignettes were rated as very easy to neutral difficulty (range of median score: 1.00‐4.00; overall median 2.00). There was a high agreement with the differential diagnosis provided by ChatGPT (range of median score: 3.00‐5.00; overall median: 5.00). There was also high agreement with treatment plans (range of median score: 3.00‐5.00; overall median: 5.00). There was no association between vignette difficulty and agreement with differential diagnosis or treatment. Lower diagnosis scores had greater odds of having lower treatment scores.
Conclusion
Generative artificial intelligence models like ChatGPT are being rapidly adopted in medicine. Performance with curated, easy‐to‐moderate difficulty otolaryngology scenarios indicate high agreement with physicians for diagnosis and management. However, a decreased quality in diagnosis is associated with decreased quality in management. Further research is necessary on ChatGPT's ability to handle unstructured clinical information.
Keywords: artificial intelligence, ChatGPT, large language model, otolaryngology
Artificial intelligence (AI) has rapidly made inroads across many industries with health care being no exception with applications ranging from diagnosis to drug development and treatment plans. Large language models (LLMs), colloquially known as chatbots, utilize natural language processing to generate human‐like conversations and have been ubiquitously described in popular media. 1 In medicine, it is crucial we understand this technology as a tool used by clinicians as well as by the public. As one of the most advanced and publicly available chatbots, chat generative pretrained transformer (ChatGPT) was mainly trained using publicly available data until September 2021 and has been shown to have high accuracy in generating human‐like responses to a wide range of questions in seconds. 2 Within medicine, recent studies have shown that ChatGPT can pass the United States Medical Licensing Exam. 3 However, it is unclear how well ChatGPT can provide a diagnosis and treatment plan when provided in an open‐ended clinical situation. Similarly, little is known about how its responses compare with human evaluation.
Early evidence demonstrates a mixed picture of the role of AI in diagnostics. One study found that AI models had sufficient accuracy in diagnosing and providing treatment plans and suggested the possibility of using AI as a tool in health care. 4 While another study compared the diagnostic accuracy of an AI model with that of internal medicine physicians in diagnosing and treating common chief complaints and found that the AI model had a significantly lower diagnostic and treatment accuracy than human physicians. 5
With public interest at a high, ChatGPT may become an increasingly utilized tool for both patients and clinicians. Its efficacy in diagnosis and treatment for specialty care, such as otolaryngology, has not been well evaluated. In this study, we compare the diagnostic accuracy and the treatment plans of ChatGPT with that of human physicians using a series of otolaryngology clinical vignettes. The results of this study highlight a relatively high accuracy in both diagnosis and treatment for common otolaryngology pathology but suggest that treatment accuracy decreases when a poor list of differential diagnoses is generated.
Methods
Institutional Review Board (IRB) Determination
This study was exempt from review by the Loma Linda University Health IRB as it does not constitute human subject research.
ChatGPT Interface and Survey Generation
Twenty clinical vignettes were conceptualized and prepared by the authors of this paper. These represented clinical presentations across multiple specialties within otolaryngology with the intent of having varying degrees of difficulty. While these were uniquely created for this study primarily using clinical experience, materials such as otolaryngology textbooks and question banks were also utilized. The clinical presentations were designed to be similar to the type that would be encountered during in‐service tests and board examinations. These vignettes were then fed into ChatGPT 4.0 from April 26, 2023 to April 27, 2023 in 2 stages. First, we provided the prompt in the following format and asked for differential diagnoses:
For the following scenario, give me 5 differential diagnoses in order of likelihood using primary scientific literature only. List references you used for each diagnosis. [Prompt]
After ChatGPT provided an answer, we next asked it to provide a treatment plan:
What is your treatment plan for your most likely diagnosis? Please include any consults, laboratory or radiographic studies, and treatments (medical and/or surgical) in your response
Example queries and outputs are provided in Supplemental Figure S1, available online. We collated the ChatGPT responses and created a survey asking participants to rate the difficulty of the prompt and report how much they agreed with ChatGPT outputs on a 5‐point Likert scale. The scale for difficulty is as follows: very easy (1), somewhat easy (2), neutral (3), somewhat difficult (4), and very difficult (5). The scale for agreement is as follows: strongly disagree (1), somewhat disagree (2), neutral (3), somewhat agree (4), and strongly agree (5). We distributed the survey to all attending physicians at our tertiary academic center via Research Electronic Data Capture, a secure web‐based application for surveys and databases. Full ChatGPT outputs and survey are also provided in Supplemental Figure S1, available online. ChatGPT 3.5 was utilized to begin writing the introduction of this paper. The original output for the introduction is provided in Supplemental Figure S2, available online.
Statistical Analysis
Data were analyzed using IBM SPSS Statistics Version 27. Summary statistics are reported for vignette difficulty, diagnostic accuracy, and treatment plan for each item and overall: median, interquartile range, and range. Univariate ordinal regression was then performed examining the relationship between vignette difficulty and diagnostic and treatment accuracy. Odds ratios with a 95% confidence interval are reported. Statistical significance was determined at p < .05.
Results
Eleven attending physicians provided complete responses to our survey, which represents a response rate of 61%. The specialty representation included 5 head and neck surgeons, 1 facial plastics surgeon, 1 laryngologist, 1 neuro‐otologist, 1 rhinologist, and 2 pediatric otolaryngologists (data not shown). Table 1 illustrates twenty clinical vignettes with the median difficulty rating, interquartile range (IQR), and full range. The lowest median score was 1.00 for prompts #3, #9, #15, and #17. The highest median score was 4.00 prompt #5. The spread varied for each vignette with IQR ranging from 0 to 2.00. Overall, the median difficulty was 2.00 with an IQR of 1.00. This suggests that attending physicians typically viewed the vignettes as easy but there was modest variation.
Table 1.
Prompt | Median (IQR) | Range |
---|---|---|
|
2.00 (1.00) | 1.00‐3.00 |
|
2.00 (1.00) | 1.00‐3.00 |
|
1.00 (1.00) | 1.00‐3.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
4.00 (1.00) | 2.00‐5.00 |
|
2.00 (2.00) | 2.00‐5.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
1.00 (0) | 1.00‐2.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
2.00 (2.00) | 1.00‐5.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
2.00 (0) | 1.00‐4.00 |
|
1.00 (1.00) | 1.00‐3.00 |
|
3.00 (2.00) | 1.00‐4.00 |
|
1.00 (1.00) | 1.00‐4.00 |
|
2.00 (2.00) | 1.00‐4.00 |
|
2.00 (1.00) | 1.00‐4.00 |
|
2.00 (2.00) | 1.00‐4.00 |
Overall | 2.00 (1.00) | 1.00‐5.00 |
Abbreviation: IQR, interquartile range.
Table 2 displays the 5 differential diagnoses provided by ChatGPT for each clinical vignette and the attending physician's agreement with the provided differential diagnoses. There was high agreement across all vignettes with most achieving a median score of 5.00. The lowest median score was 3.00 for prompt #19. The IQR for each prompt was small, usually 1.00 or less. Prompt #19 had the highest spread with an IQR of 2.00. Overall, there was again high agreement across all the vignettes with a median of 5.00 and IQR of 1.00.
Table 2.
Prompt | Differential diagnoses | Median (IQR) | Range |
---|---|---|---|
1 |
1. Orbital blowout fracture 2. Orbital contusion 3. Superior rectus/levator palpebrae superioris muscle injury 4. Cranial nerve palsy (specifically, oculomotor nerve palsy) 5. Periorbital hematoma |
5.00 (1.00) | 4.00‐5.00 |
2 |
1. Acute bacterial sinusitis 2. Acute viral rhinosinusitis 3. Allergic rhinitis 4. Fungal sinusitis 5. Dental infection with sinus involvement |
5.00 (1.00) | 4.00‐5.00 |
3 |
1. Squamous cell carcinoma (SCC) 2. Verrucous carcinoma: A rare, well‐differentiated variant of SCC 3. Oral pyogenic granuloma 4. Major aphthous ulcer (Sutton's disease) 5. Oral lichen planus |
5.00 (0) | 4.00‐5.00 |
4 |
1. Spontaneous cerebrospinal fluid leak 2. Idiopathic intracranial hypertension (pseudotumor cerebri) 3. Chronic tension‐type headache 4. Migraine without aura 5. Sinus headache (noninfectious) |
5.00 (1.00) | 2.00‐5.00 |
5 |
1. Granulomatosis with polyangiitis (formerly known as Wegener's granulomatosis) 2. Extranodal natural killer/T‐cell lymphoma, nasal type 3. Invasive fungal sinusitis (eg, mucormycosis or aspergillosis) 4. Tuberculosis infection with sinonasal involvement 5. Syphilis with sinonasal manifestations |
4.00 (3.00) | 1.00‐5.00 |
6 |
1. Vestibular schwannoma (acoustic neuroma) 2. Ramsay Hunt syndrome (herpes zoster oticus) 3. Ménière's disease 4. Temporal bone fracture 5. Autoimmune inner ear disease |
5.00 (1.00) | 4.00‐5.00 |
7 |
1. Chronic otitis media with cholesteatoma 2. Chronic serous otitis media 3. Adhesive otitis media 4. Tympanosclerosis 5. Congenital cholesteatoma |
5.00 (0) | 5.00‐5.00 |
8 |
1. Pheochromocytoma with a cervical paraganglioma 2. Cervical lymphadenopathy due to metastatic malignancy 3. Schwannoma of the cervical sympathetic chain 4. Branchial cleft cyst 5. Tuberculous lymphadenitis |
5.00 (1.00) | 4.00‐5.00 |
9 |
1. Auricular hematoma 2. Perichondritis 3. Traumatic ear laceration 4. Acute otitis externa 5. Temporomandibular joint dysfunction |
5.00 (0) | 5.00‐5.00 |
10 |
1. Sudden sensorineural hearing loss 2. Acoustic trauma 3. Barotrauma 4. Eustachian tube dysfunction 5. Meniere's disease (early stage) |
5.00 (0) | 5.00‐5.00 |
11 |
1. Subglottic stenosis 2. Tracheomalacia 3. Vocal cord dysfunction 4. Bronchiolitis obliterans 5. Reactive airway dysfunction syndrome |
4.00 (1.00) | 3.00‐5.00 |
12 |
1. Sjögren's syndrome 2. Sarcoidosis 3. Mikulicz's disease (IgG4‐related disease) 4. Recurrent parotitis 5. Lymphoma |
4.00 (1.00) | 1.00‐5.00 |
13 |
1. Goiter associated with Hashimoto's thyroiditis 2. Thyroid nodule(s) 3. Thyroid cyst 4. Thyroid cancer 5. Thyroid lymphoma |
5.00 (0) | 4.00‐5.00 |
14 |
1. Orbital cellulitis 2. Preseptal cellulitis (periorbital cellulitis) 3. Frontal sinusitis with subperiosteal abscess 4. Ethmoid sinusitis with orbital extension 5. Allergic rhinitis with a secondary bacterial infection |
4.00 (1.00) | 3.00‐5.00 |
15 |
1. Peritonsillar abscess 2. Infectious mononucleosis (Epstein‐Barr virus infection) 3. Bacterial tonsillitis 4. Lemierre's syndrome 5. Viral pharyngitis |
5.00 (0) | 4.00‐5.00 |
16 |
1. Vocal fold nodules 2. Muscle tension dysphonia 3. Chronic laryngitis 4. Reinke's edema (polypoid corditis) 5. Vocal fold paralysis or paresis |
4.00 (1.00) | 2.00‐5.00 |
17 |
1. Recurrent laryngeal nerve injury or palsy 2. Vocal cord paralysis 3. Aspiration pneumonia 4. Laryngeal edema 5. Laryngopharyngeal reflux |
5.00 (1.00) | 2.00‐5.00 |
18 |
1. Congenital muscular torticollis 2. Sternocleidomastoid tumor (fibromatosis colli) 3. Lymphatic malformation (cystic hygroma) 4. Branchial cleft cyst 5. Hemangioma |
5.00 (1.00) | 4.00‐5.00 |
19 |
1. Acute suppurative thyroiditis 2. Subacute (De Quervain's) thyroiditis 3. Lymphadenitis 4. Thyroglossal duct cyst infection 5. Abscess |
3.00 (2.00) | 1.00‐5.00 |
20 |
1. Acute otitis media with complications 2. Mastoiditis 3. Gradenigo's syndrome 4. Cholesteatoma with complications 5. Temporal bone osteomyelitis |
4.00 (1.00) | 2.00‐5.00 |
Overall | 5.00 (1.00) | 1.00‐5.00 |
Abbreviations: ChatGPT, chat generative pretrained transformer; IgG4, immunoglobulin G4; IQR, interquartile range.
Table 3 reports abbreviated management plans for the most likely differential diagnosis provided by ChatGPT, not the treatment plan best suited for the vignette. Full ChatGPT outputs are provided as supplemental information (Supplemental Figure S1, available online). Again, there was high agreement with the treatment plan for what ChatGPT felt was the most likely diagnosis with the vast majority achieving a median score of at least 4.00. The lowest median score was 3.00 for prompt #19. The highest median score was 5.00. The lowest IQR was 0 for prompts #6, #7, and #9. The highest IQR was 4.00 for prompt #5. Prompts #12 and #19 had an IQR of 3.00. Agreement with treatment was similarly high. Overall, the median agreement with treatment was 5.00 with an IQR of 1.00.
Table 3.
Prompt | Treatment | Median (IQR) | Range |
---|---|---|---|
1 |
Based on the given scenario, the most likely diagnosis is an orbital blowout fracture. Here's a suggested treatment plan: Consults—ophthalmology, OMFS Radiographic studies—CT scan of orbits and facial bones Medical treatment—pain control, tetanus prophylaxis if immunization is not up to date, antibiotics may be prescribed Surgical treatment—depending on the extent of the fracture, muscle entrapment, and any functional or cosmetic concerns, surgical repair may be necessary within 1‐2 weeks of injury. Surgery typically involves reconstruction of the orbital floor or wall with implants or grafts, as well as the release of any entrapped muscles Follow‐up—regular follow‐up appointments with ophthalmology and maxillofacial surgery |
4.00 (2.00) | 1.00‐5.00 |
2 |
The most likely diagnosis, in this case, is acute bacterial sinusitis. Here's a suggested treatment plan: Consults—ENT if symptoms persist despite medical therapy Radiographic studies—CT sinus if there is no significant improvement within 48‐72 hours of starting antibiotics Medical treatment—antibiotics (amoxicillin‐clavulanate for 5‐10 days in adults; doxycycline or a respiratory fluoroquinolone if the patient has a penicillin allergy) Symptomatic relief—pain relief, oral or topical nasal decongestant no longer than 3 days to avoid rebound congestion, nasal saline irrigation Follow up—educate the patient on the importance of completing the full course of antibiotics, follow up in 10‐14 days |
5.00 (1.00) | 4.00‐5.00 |
3 |
Given the patient's history and presentation, the most likely diagnosis is squamous cell carcinoma (SCC) of the tongue. Here's a suggested treatment plan: Consults—ENT and OMFS Radiographic studies—contrast‐enhanced CT or MRI of the head and neck, CXR, or CT chest may be considered to rule out distant metastasis Other studies—biopsy of the lesion should be performed, along with imaging can be used to stage the tumor for treatment Treatment—may include a combination of surgery, radiation therapy, and chemotherapy |
5.00 (1.00) | 4.00‐5.00 |
| |||
4 |
The most likely diagnosis, in this case, is a spontaneous cerebrospinal fluid leak. Here's a suggested treatment plan: Consults—ENT and/or NSGY if CSF leak does not resolve with conservative measures Radiographic studies—high‐resolution CT of sinuses, MRI with intrathecal gadolinium Other studies—LP to measure opening pressure and obtain CSF for analysis Medical treatment—bed rest with head elevation, analgesics, and prophylactic antibiotics to minimize the risk of meningitis if necessary Surgical treatment—endoscopic surgical repair if the leak does not resolve with conservative measures, lumbar drain or shunt if the leak is not identifiable or if multiple leaks are present Follow up—weight loss/obesity management, monitor recurrence of symptoms which if persist or worsen then consider repeat imaging and re‐evaluating treatment plan |
5.00 (2.00) | 2.00‐5.00 |
5 |
The most likely diagnosis, in this case, is granulomatosis with polyangiitis. Here's a suggested treatment plan: Consults—rheumatology, ENT Radiographic studies— CXR, CT Chest Labs—CBC, CMP, ESR, CRP, ANCA, UA Biopsy—nasal or sinus biopsy Medical treatment—induction high‐dose corticosteroids with a DMARD; maintenance with less toxic DMARD Follow up—rheumatology, monitor laboratory markers, assess for end organ complications |
4.00 (4.00) | 1.00‐5.00 |
6 |
Based on the patient's history and presentation, the most likely diagnosis is a vestibular schwannoma (acoustic neuroma). Here's a suggested treatment plan: Consults—neuro‐otologist, NSGY, audiologist Radiographic studies—MRI head w/contrast Treatment options—observation (small tumors with mild symptoms), stereotactic radiosurgery (serviceable hearing and tumor < 3 cm), surgical resection (large tumors with significant symptoms or when hearing preservation is not possible) Follow‐up—facial nerve rehabilitation, neuro‐otology, NSGY |
5.00 (0) | 1.00‐5.00 |
7 |
The most likely diagnosis is Chronic Otitis Media with Cholesteatoma. Here is a treatment plan for this condition: Consults—ENT, audiologist Radiographic studies—high‐resolution CT of the temporal bones Medical treatment—treat concurrent infections with antibiotics and/or corticosteroid ear drops Surgical treatment—excision which may include canal‐wall‐up or canal‐wall‐down mastoidectomy with tympanoplasty Follow up—monitor for postoperative complications, regular follow up with otoscopic exams and audiograms, manage eustachian tube dysfunction and allergic rhinitis |
5.00 (0) | 1.00‐5.00 |
8 |
The most likely diagnosis, in this case, is pheochromocytoma with a cervical paraganglioma. Here's a treatment plan for this patient: Consults—endocrinology, ENT Radiographic studies—CT or MRI of the neck with contrast, abdominal CT or MRI Labs—plasma or 24‐hour urine metanephrines and catecholamines, CBC, renal function tests, and electrolytes Medical treatment—alpha‐blockers and subsequently beta‐blockers Surgical treatment—laparoscopic or open adrenalectomy and neck dissection Follow‐up—monitor patient's blood pressure and catecholeamine levels, adjust antihypertensive medications, imaging and laboratory testing for recurrence |
4.00 (1.00) | 2.00‐5.00 |
9 |
For the most likely diagnosis of auricular hematoma, the treatment plan would include the following steps: Consults—ENT Treatment—cold compress, aspiration or incision and drainage, pressure dressing or bolster after drainage, antibiotic prophylaxis, pain management Follow‐up—ENT within 1 week to monitor healing and removal of pressure dressing, avoid contact sports or other potentially traumatic activities until healing, and protective headgear in the future for prevention |
5.00 (0) | 4.00‐5.00 |
10 |
The most likely diagnosis, in this case, is sudden sensorineural hearing loss. Here's a treatment plan for this patient: Consults—ENT Radiographic studies—MRI brain and IAC with contrast Labs—CBC, renal function tests, electrolytes, autoimmune labs, and infectious etiologies (eg, Lyme disease, syphilitic) if indicated Treatment—oral corticosteroids preferably within 72 hours of onset with a 10‐14 day taper, intratympanic corticosteroid injections if oral treatment is contraindicated or ineffective, supportive care including hearing aids or assistive listening devices as necessary Follow up—follow up audiograms and adjust treatments as necessary |
4.00 (1.00) | 3.00‐5.00 |
11 |
The most likely diagnosis in this case is subglottic stenosis. The treatment plan for subglottic stenosis includes the following: Consults—ENT, pulmonology Radiographic studies—CT neck and chest Other studies—flexible laryngoscopy, pulmonary function tests Medical treatment—inhaled corticosteroids and bronchodilators Surgical options—endoscopic dilation, endoscopic lysis of adhesions, laser surgery, open surgical repair Follow‐up—speech therapy, pulmonary rehabilitation, regular follow‐up with ENT and pulmonology to monitor for recurrence |
5.00 (1.00) | 3.00‐5.00 |
12 |
The most likely diagnosis, in this case, is Sjögren's syndrome. Here's a treatment plan for this patient: Consults—rheumatology, ophthalmology Radiographic studies—CXR or CT test, parotid gland ultrasound Labs—CBC, renal function tests, electrolytes, liver function tests, ANA, anti‐Ro/SSA, anti‐La/SSB, ESR, CRP, immunoglobulins Other studies—Schirmer's test, minor salivary gland biopsy if clinically indicated Treatment—artificial tears and lubricating eye ointments, saliva substitutes, sugar‐free gum or lozenges, hydroxychloroquine, NSAIDs, pilocarpine, or cevimeline Follow‐up—close monitoring for potential complications, regular dental care, and good oral hygiene |
4.00 (3.00) | 1.00‐5.00 |
13 |
The most likely diagnosis, in this case, is a goiter associated with Hashimoto's thyroiditis. The treatment plan for goiter in the context of Hashimoto's thyroiditis includes the following: Consults—endocrinology Radiographic studies—thyroid ultrasound Labs—thyroid function tests Other studies—fine‐needle aspiration biopsy Medical treatment—levothyroxine for hypothyroidism and optionally for goiter size reduction Surgical treatment—thyroidectomy (partial or total) if there are compressive symptoms, malignancy is confirmed or strongly suspected, goiter is cosmetically bothersome Follow‐up—endocrinology, periodic thyroid ultrasound, and thyroid function tests |
5.00 (1.00) | 2.00‐5.00 |
14 |
The most likely diagnosis, in this case, is orbital cellulitis. Here's a treatment plan for this patient: Consults—ENT, ophthalmology Radiographic studies—CT or orbit and sinuses with contrast Labs—CBC, blood cultures, culture from sinus or conjunctival swab if possible Medical treatment—hospital admission, broad‐spectrum intravenous antibiotics to cover staphylococcus aureus and streptococcus species, adjust antibiotics when culture results are available, analgesics and antipyretics, nasal decongestants Surgical treatment—consider drainage of abscess or endoscopic sinus surgery in cases of severe infection, abscess, or lack of clinical improvement with conservative management Follow‐up—monitor for complications including vision loss, cavernous sinus thrombosis, and intracranial spread of infection |
5.00 (1.00) | 1.00‐5.00 |
15 |
The most likely diagnosis, in this case, is a peritonsillar abscess. Here is a suggested treatment plan: Consults—ENT Radiographic studies—CT neck with contrast Labs—CBC, blood cultures Medical treatment—intravenous hydration, intravenous broad‐spectrum antibiotics to cover group A Streptococcus and anaerobes, analgesics and antipyretics, adjust antibiotics based on culture and clinical response Surgical treatment—needle aspiration, incision, and drainage, or tonsillectomy Follow‐up—monitor patient's clinical response to treatment, follow up with ENT for recovery, and discuss elective tonsillectomy if recurrent tonsillitis is an issue |
5.00 (1.00) | 2.00‐5.00 |
16 |
The most likely diagnosis, in this case, is vocal fold nodules. Here's a treatment plan for this patient: Consults—ENT, speech‐language pathologist Other studies—laryngoscopy, videostroboscopy Medical treatment—voice therapy, vocal hygiene, voice rest, vocal warm‐ups and cool‐downs Surgical treatment—microlaryngoscopy with excision or laser ablation if conservative management fails to yield improvement or in severe cases Follow‐up—monitor for improvement with voice therapy and consider re‐evaluating diagnosis and exploring alternative treatments if there is no significant improvement in a few months |
4.00 (1.00) | 1.00‐5.00 |
17 |
Based on the most likely diagnosis of recurrent laryngeal nerve injury or palsy following coronary artery bypass graft surgery, I can suggest a general approach for management: Consults—ENT Radiographic studies—CT or MRI of neck and chest Other studies—laryngoscopy Medical treatment—voice and swallow therapy Surgical treatment—injection laryngoplasty, medialization thyroplasty, or arytenoid adduction Follow‐up—aspiration precautions, pneumonia prevention, follow‐up with ENT and speech‐language pathology |
5.00 (1.00) | 4.00‐5.00 |
18 |
The most likely diagnosis, in this case, is congenital muscular torticollis. Here's a treatment plan for this patient: Consults—pediatrician, physical therapist Radiographic studies—neck ultrasound Medical treatment—physical therapy, parent education for strengthening and stretching exercises, head positioning techniques Surgical treatment—surgical release of the sternocleidomastoid muscle in severe cases or cases where physical therapy fails to yield improvement Follow‐up—pediatrician, physical therapist |
5.00 (1.00) | 2.00‐5.00 |
19 |
Based on the most likely diagnosis of acute suppurative thyroiditis, I can suggest a general approach for management: Consults—endocrinology, ENT Radiographic studies—neck ultrasound, CT neck with contrast Labs—CBC, thyroid function tests Other studies—fine‐needle aspiration Medical treatment—intravenous broad‐spectrum antibiotics until culture results are available after which adjust the antibiotics for a 2‐3 week course, pain and fever management Surgical treatment—drainage of abscess or in rare cases thyroidectomy (partial or total) Follow‐up—endocrinology, ENT |
3.00 (3.00) | 1.00‐5.00 |
20 |
The most likely diagnosis, in this case, is acute otitis media with complications. Here's a treatment plan for this patient: Consults—ENT, audiology Imaging studies—CT of the temporal bone Other studies—CBC, CRP, ESR Medical treatment—broad‐spectrum antibiotics, analgesics, antipyretics Surgical treatment—myringotomy with or without tympanostomy tube placement or a mastoidectomy if complications such as mastoiditis or cholesteatoma are identified Follow‐up—close monitoring for improvement, complications, and need for surgery |
4.00 (2.00) | 2.00‐5.00 |
Overall | 5.00 (1.00) | 1.00‐5.00 |
Abbreviations: ANA, antinuclear antibody; ANCA, antineutrophil cytoplasmic antibody; CBC, complete blood count; ChatGPT, chat generative pretrained transformer; CMP, comprehensive metabolic panel; CRP, C‐reactive protein; CSF, cerebrospinal fluid; CT, computed tomography; CXR, chest X‐ray; DMARD, disease‐modifying antirheumatic drug; ENT, otolaryngology; ESR, erythrocyte sedimentation rate; IAC, internal auditory canal; IQR, interquartile range; LP, lumbar puncture; MRI, magnetic resonance imaging; NSAIDs, nonsteroidal anti‐inflammatory drugs; NSGY, neurosurgery; OMFS, oral maxillofacial surgery; SCC, squamous cell carcinoma; UA, urinanalysis.
To elucidate the potential limitations of ChatGPT in diagnosis and treatment, we performed univariate ordinal regression between prompt difficulty, diagnosis score, and treatment score. Strongly agree (Likert score of 5) was used as the reference level. Odds ratios with 95% confidence intervals and p value are presented in Table 4. The prompt difficulty was not a significant predictor of the diagnostic score as the odds of having a higher diagnostic score did not vary with increasing Likert score for prompt difficulty. Similarly, there was no association between prompt difficulty and treatment score. The odds of having a higher diagnostic score did not vary with an increasing Likert score for prompt difficulty. The diagnostic score did appear to be significantly related to the treatment score. Having a lower diagnostic score had lower odds of having a higher treatment score. More simply put, a lower diagnostic score was more likely associated with lower treatment scores. Odds ratios for diagnostic scores of 1, 2, 3, and 4 were 0.381, 0.330, 0.289, and 0.301 respectively. Diagnostic scores of 2 and 4 reach statistical significance (p < .001 and p = .050, respectively) while a score of 3 trended toward significance (p = .052). Having a Likert score of 1 likely did not reach significance (p = .307) due to the small sample size, as evidenced by the wide confidence interval for this level. Overall, this suggests that if there is no strong agreement with the diagnosis, agreement with treatment will also tend to be lower. Again, we emphasize that the treatment score is based on the top differential diagnosis selected by ChatGPT and not necessarily the diagnosis most otolaryngologists would select.
Table 4.
Dependent: Diagnosis score Independent: Prompt difficulty |
Dependent: Treatment score Independent: Prompt difficulty |
Dependent: Treatment score Independent: Diagnosis score |
|||||||
---|---|---|---|---|---|---|---|---|---|
Likert score | OR | 95% CI | p Value | OR | 95% CI | p Value | OR | 95% CI | p Value |
1 | 2.01 | (0.275, 14.30) | .495 | 0.482 | (0.046, 5.00) | .542 | 0.381 | (0.060, 2.44) | .307 |
2 | 0.853 | (0.121, 5.99) | .874 | 0.273 | (0.027, 2.80) | .274 | 0.330 | (0.110, 1.00) | .050 |
3 | 0.865 | (0.106, 7.03) | .892 | 0.868 | (0.071, 10.70) | .912 | 0.289 | (0.082, 1.01) | .052 |
4 | 0.660 | (0.087, 5.00) | .162 | 0.232 | (0.021, 2.53) | .231 | 0.301 | (0.170, 0.533) | <.001 |
5 | 1 | ‐ | ‐ | 1 | ‐ | ‐ | 1 | ‐ | ‐ |
Abbreviations: CI, confidence interval; OR, odds ratio.
Discussion
As a result of increasing data availability and accessibility to high‐performance AI technologies, AI is rapidly integrating across many industries, including health care. One of the most promising applications of AI in health care is diagnostics. The incorporation of publicly available and culturally ubiquitous AI systems, such as ChatGPT, will be inevitable in medicine. Whether the user is laymen or a clinician, it is imperative that physicians, particularly those in specialized care, understand its potential and limitations.
Several authors have begun to examine how ChatGPT might become involved in medicine. Some emphasize its role in supporting clinical tasks by helping create medical reports, patient‐specific forms and handouts, medical publishing, administration, and provider education. 6 In fact, we used ChatGPT to assist in writing the introduction. While it required significant human editing, ChatGPT provided a strong framework to start with (Supplemental Figure S2, available online).
To date, no study has evaluated ChatGPT's performance in clinical otolaryngology, but several studies have found it to perform well in adjacent fields. Despite not being trained on a specific data set, ChatGPT performed at the level of a first‐year resident in plastic surgery on the in‐service training exam. 7 , 8 In neurosurgery, ChatGPT performed worse than the average user on Self‐Assessment Neurosurgery questions but better than residents in some topics. 9 Clearly, there is already some rudimentary capacity in providing specialty care. This is consistent with the findings of our study which demonstrated a high rating from physicians for easy to moderately difficult clinical vignettes.
As evidenced in our study, when presented with clinical vignettes of well‐known clinical issues using medical jargon and curated relevant history, physical exam, and radiographic and laboratory findings, ChatGPT provides a very accurate differential diagnosis and reasonable treatment plans. This is likely due to the similarity of the vignettes provided in this study to the type of writing found in textbooks, scientific papers, and other data sources the AI model was trained on. This also explains why we failed to identify an association between prompt difficulty and diagnosis and treatment scores. ChatGPT performed better than the authors of this study expected for the difficulty of the prompts provided. If more difficult prompts were provided, we suggest that variations in diagnosis and treatment scores would become more apparent. Also, the apparent discordance between human‐assigned difficulty and the performance of the AI system is not surprising when we understand that the relative strengths and weaknesses between human intelligence and AI are different. Humans will tend to assign more difficulty to rare and esoteric conditions and treatments because the amount of data we can store is limited and we will prioritize more common and useful information. Computers do not have these limitations and can have access to more information than any person can have in their mind and obscurity is a trivial barrier for them. Nuanced diagnosis and treatment of more common clinical problems presented by patients in nonmedical jargon will likely be perceived as an easy problem by human evaluators but be difficult for AI systems.
Furthermore, it remains unclear how well it would perform if provided with real‐world, open‐ended free‐form histories, complete exams, and uncurated data which would often contain irrelevant, extraneous, and contradictory information. Unfiltered clinical information may prove too challenging for current LLMs to narrow into meaningful diagnoses, particularly in specialty fields. Several studies have demonstrated its impressive but inferior performance to human physicians in a variety of clinical vignettes. 5 , 10 Future studies may compare how ChatGPT performs with open‐ended inputs provided by otolaryngology patients (eg, “Why do I have nasal obstruction?”). Finally, some studies note that ChatGPT may generate different responses for the same prompt entered in multiplicate which undermines its clinical reliability. 11
The role of ChatGPT from a patient's perspective in certain situations has been explored. ChatGPT is able to answer a variety of patient questions about colonoscopy in a digestible and generally satisfactory manner. 12 It is also able to provide high‐quality answers to patient questions in pre‐ and postoperative care in oral and maxillofacial surgery, though the author suggests it be used in conjunction with surgeon experience. 13 Within otolaryngology, ChatGPT provides patients with procedure‐specific instructions equivalent to institutional standards. 14 However, ChatGPT instructions were equal to and inferior to Google's in terms of understandability and actionability, respectively. 14 Patients are likely to find ChatGPT to be a useful and accessible resource. And while the quality of its answers for patients is surprisingly high, it cannot be used as a direct substitute for a physician's counseling. As alluded to above and by other authors, it may serve as a useful adjunct or starting point for patient handouts. Patients using ChatGPT should be counseled on these limitations. Furthermore, its use in guiding general practice providers in specialty care has not been assessed but remains an interesting concept.
Some limitations in ChatGPT may be inherent to its underlying data modeling. For instance, it lacks knowledge of events occurring after September 2021 and does not learn from its own experience. 2 Additionally as its creators note ChatGPT may produce artificial hallucinations (ie, “produce content that is nonsensical or untruthful in relation to certain sources”) which can be difficult to discern when juxtaposed with ever‐increasing believability. 2 Within biomedical research, this phenomenon is apparent when ChatGPT returns fake references when it is asked to cite its sources. 15 How this impacts its capacity in medical decision‐making is not readily apparent.
Finally, from an ethics perspective, it should be reiterated that AI systems like ChatGPT in medicine are niche, and generalizability is still in its infancy. Moreover, unknown bias in ChatGPT can have significant and unintended consequences for patient outcomes. Inherent to AI is the training set they are based on which can in turn perpetuate disparities and biases in race, sex, and culture. 16 , 17 , 18 Thus, the need for large amounts of high‐quality data that accurately and equitably represent a diverse patient population and understanding intrinsic bias when applying ChatGPT is imperative. Physicians that consider using ChatGPT in clinical practice must understand this aspect. And the burgeoning field of bioethics will also have to contemplate the risks and benefits of the use of AI in health care. 19
Conclusion
Overall, ChatGPT is a promising technology for both patients and physicians. Our study demonstrated that within otolaryngology, given highly curated vignettes, it provided differential diagnoses and treatment plans for easy to moderately difficult clinical scenarios that physicians highly agreed with. However, both physicians and patients need to be aware of the challenges and limitations of this LLM before implementing it in clinical practice. Physicians, medical societies, and patients, among other important stakeholders, should be involved in the development and application of these infant technologies in medicine.
Author Contributions
Roy W. Qu, study design, data acquisition, statistical analysis, data interpretation, and manuscript preparation; Uneeb Qureshi, data acquisition, study design, data interpretation, and manuscript preparation; Garrett Petersen, data acquisition, study design, data interpretation, and manuscript preparation; Steve C. Lee, study design, data interpretation, and manuscript preparation. All authors approve the manuscript and agree to be accountable for all aspects of the work presented herein.
Disclosures
Competing interests
The authors have no conflicts of interest to declare.
Funding source
None.
Supporting information
References
- 1. Deng J, Lin Y. The benefits and challenges of ChatGPT: an overview. Front Comput Intell Syst. 2023;2(2):81‐83. 10.54097/fcis.v2i2.4465 [DOI] [Google Scholar]
- 2. OpenAI . GPT‐4 technical report. OpenAI; 2023.
- 3. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Feng Y, Wang Y, Zeng C, Mao H. Artificial intelligence and machine learning in chronic airway diseases: focus on asthma and chronic obstructive pulmonary disease. Int J Med Sci. 2021;18(13):2871‐2889. 10.7150/ijms.58191 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic accuracy of differential‐diagnosis lists generated by generative pretrained transformer 3 Chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20(4):3378. 10.3390/ijerph20043378 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Biswas S. ChatGPT and the future of medical writing. Radiology. 2023;307(2):e223312. 10.1148/radiol.223312 [DOI] [PubMed] [Google Scholar]
- 7. Humar P, Asaad M, Bengur FB, Nguyen V. TEMPORARILY REMOVED: ChatGPT is equivalent to first‐year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in‐service examination. Aesthet Surg J. 2023. 10.1093/asj/sjad130 [DOI] [PubMed] [Google Scholar]
- 8. Gupta R, Herzog I, Park JB, et al. TEMPORARILY REMOVED: performance of ChatGPT on the plastic surgery inservice training examination. Aesthet Surg J. 2023. 10.1093/asj/sjad128 [DOI] [PubMed] [Google Scholar]
- 9. Hopkins BS, Nguyen VN, Dallas J, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board‐style questions. J Neurosurg. 2023. 10.3171/2023.2.JNS23419 [DOI] [PubMed] [Google Scholar]
- 10. Rao A, Pang M, Kim J, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv. 2023. 10.1101/2023.02.21.23285886 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Hopkins AM, Logan JM, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm‐shift. JNCI Cancer Spectrum. 2023;7(2):pkad010. 10.1093/jncics/pkad010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lee TC, Staller K, Botoman V, Pathipati MP, Varma S, Kuo B. ChatGPT answers common patient questions about colonoscopy. Gastroenterology. 2023;165:509‐511. 10.1053/j.gastro.2023.04.033 [DOI] [PubMed] [Google Scholar]
- 13. Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023:101471. 10.1016/j.jormas.2023.101471 [DOI] [PubMed] [Google Scholar]
- 14. Ayoub NF, Lee YJ, Grimm D, Balakrishnan K. Comparison between ChatGPT and Google search as sources of postoperative patient instructions. JAMA Otolaryngol Head Neck Surg. 2023;149:556. 10.1001/jamaoto.2023.0704 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179. 10.7759/cureus.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Li H, Moon JT, Purkayastha S, Celi LA, Trivedi H, Gichoya JW. Ethics of large language models in medicine and medical research. Lancet Digital Health. 2023;5:e333‐e335. 10.1016/S2589-7500(23)00083-3 [DOI] [PubMed] [Google Scholar]
- 17. Lucy L, Bamman D, Gender and Representation Bias in GPT‐3 Generated Stories. Association for Computational Linguistics; 2021. [Google Scholar]
- 18. Abid A, Farooqi M, Zou J. Large language models associate Muslims with violence. Nat Mach Intell. 2021;3(6):461‐463. 10.1038/s42256-021-00359-2 [DOI] [Google Scholar]
- 19. Amann J, Blasimme A, Vayena E, Frey D, Madai VI. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak. 2020;20(1):310. 10.1186/s12911-020-01332-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.