Abstract
Purpose:
To assess the accuracy and completeness of 3 large language models (LLMs) to generate information about antibody drug conjugate (ADC) associated ocular toxicities.
Methods:
There were 22 questions about ADCs, tisotumab vedotin (TV), and mirvetuximab soravtansine (MIRV) that were developed and input into ChatGPT 4.0, Bard, and LLaMa. Answers were rated by 4 ocular toxicity experts using standardized 6-point Likert scales on accuracy and completeness. ANOVA tests were conducted for comparison between the three sub-groups, followed by pairwise t-tests. Interrater variability was assessed with Fleiss kappa tests.
Results:
The mean accuracy score was 4.62 (SD 0.89) for ChatGPT, 4.77 (SD 0.90) for Bard, and 4.41 (SD 1.09) for LLaMA. Both ChatGPT (p=0.03) and Bard (p=0.003) scored significantly better for accuracy when compared to LLaMA. The mean completeness score was 4.43 (SD 0.91) for ChatGPT, 4.57 (SD 0.93) for Bard, and 4.42 (SD 0.99) for LLaMA. There were no significant differences in completeness scores between groups. Fleiss kappa assessment for interrater variability was good (0.74) for accuracy and fair (0.31) for completeness.
Conclusions:
All 3 LLMs had relatively high accuracy and completeness ratings, showing LLMs are able to provide sufficient answers for niche topics of ophthalmology. Our results indicate that ChatGPT and Bard may be slightly better at providing more accurate answers then LLaMA. As further research and treatment plans are developed for ADC-associated ocular toxicities, these LLMs should be re-assessed to see if they provided complete and accurate answers that remain with current medical knowledge.
Keywords: antibody drug conjugates, artificial intelligence, large language models, ocular adverse effects, tisotumab vedotin, mirvetuximab soravtansine
Introduction
Large language models (LLM) are artificial intelligence (AI) chatbots that generate human-like answers to questions input by users. These models utilize machine learning and are trained on vast texts of data to interpret user inputted questions and provide real-time answers. ChatGPT (Open AI, San Francisco, California, USA), Bard (Google, Mountain View, California, USA), and LLaMA Version 2 (Meta, Menlo Park, California, USA) are some of the most commonly known and frequented LLMs. LLMs have already begun to contribute in healthcare, and for ophthalmology specifically, ChatGPT has been assessed for accuracy in providing patient-facing medical information and assisting primary care providers with providing eye care [1, 2]. The newest version of ChatGPT, Version 4, has already shown marked improvement from its prior 3.5 corpus with providing accurate diagnoses when benchmarked against corneal specialists [3] although this version still struggles with uveitis and ocular inflammation questions [4]. With the rampant rise of the utilization of AI chatbots in many medical contexts, the validity of answer content and thoroughness needs to be assessed.
Antibody drug conjugates (ADC) are a novel class of oncologic medications that use a humanized monoclonal antibody attached to a cytotoxic payload to selectively target cell surface molecules upregulated in malignancies. Tisotumab vedotin (Tivdak®, Seagen Inc., Seattle, USA and Genmab, Copenhagen, Denmark) and mirvetuximab soravtansine (Elahere®, ImmunoGen Inc, Waltham, Massachusetts) are two US Food and Drug Administration (FDA) approved medications used for the treatment of cervical cancer and ovarian or fallopian tube tumors, respectively. These two novel drugs were chosen for this study as they are two FDA approved ADCs with similar treatment indications (gynecological malignancies) and are associated with pronounced ocular surface adverse effects. Ocular surface adverse effects occur in 53% of patients treated with tisotumab [5] and 41–53% patients treated with mirvetuximab [6, 7]. Tisotumab ocular toxicities commonly include conjunctivitis, dry eye, keratitis, and blepharitis [5]. Mirvetuximab is associated with blurred vision, dry eye, and keratitis [8]. These ocular adverse events may lead to dose modification or delay, necessitate close coordination of care between ophthalmologists and oncologists, and require regular ophthalmic examinations [9].
While adverse effects are common with ADC treatment, there are few ophthalmologists that manage drug-induced ocular adverse events regularly and study outcomes as a research niche. Furthermore, the number of antineoplastic agents is increasing overall with a plethora of ocular side effect profiles, posing a diagnostic challenge for many ophthalmologists. Assistance of LLMs could help enhance patients’ and general ophthalmologists’ knowledge about these medications in real time, but it is important to understand if accurate and complete information can be provided through these models. Therefore, we aim to assess how three different LLMs, ChatGPT Version 4, Bard, and Llama Version 2 are able to provide information about tisotumab (TV) and mirvetuximab (MIRV) associated corneal toxicities and the management of these ocular events.
Methods
Study Design.
A list of questions (Table 1) that assessed structure and mechanism of action of ADCs, Tisotumab (TV) specific ocular toxicity and adverse effect management, and Mirvetuximab (MIRV) specific ocular toxicity and adverse effect management was developed by the researchers and reviewed by an ocular inflammation and immunology expert to assess for validity. Six general ADC, 8 TV and 8 MIRV-specific questions were subsequently inputted into ChatGPT Version 4.0, AI Bard, and LLaMA Version 2 in November of 2023. Questions were input 3 times in the same chat session by one researcher for each chatbot (RM, HX, NK) to ensure reproducibility of answers across the three platforms. The first answer from each chat session was used for this study due to similar answers provided when repeated. Questions and answers were then compiled into Microsoft surveys and were sent to 4 ocular toxicity experts across the country, including three ocular oncologists and a uveitis specialist with collectively 25 years of post-fellowship clinical experience.
Table 1:
ADC Questions Asked LLMs
| General ADC Questions |
|---|
| 1. What is an antibody drug conjugate? |
| 2. What are the different parts of an antibody drug conjugate? |
| 3. What is a cytotoxic payload? |
| 4. What is the mechanism of action of an antibody drug conjugate? Be specific. |
| 5. What are the benefits of using an antibody drug conjugate compared to traditional chemotherapy? |
| 6. What are the FDA approved antibody drug conjugates used for the treatment of gynecological malignancies? |
| Tisotumab Vedotin and Ocular Toxicity: |
| 7. What is the mechanism of action of tisotumab vedotin? |
| 8. How common are ocular adverse effects in tisotumab vedotin? |
| 9. What are the most common ocular adverse effects from tisotumab vedotin? Be specific. |
| 10. What is the proposed mechanism of the ocular toxicity of tisotumab vedotin? |
| 11. How should tisotumab vedotin ocular adverse effects be managed? |
| 12. Is there a specific mitigation plan for preventing ocular adverse effects from tisotumab vedotin? |
| 13. What are some warning signs patients should be aware of for tisotumab vedotin ocular toxicity? |
| 14. When should tisotumab vedotin be discontinued due to ocular toxicity? |
| Mirvetuximab Soravtansine and Ocular Toxicity: |
| 15. What is the mechanism of action of mirvetuximab soravtansine? |
| 16. How common are ocular adverse effects in mirvetuximab soravtansine? |
| 17. What are the most common ocular adverse effects from mirvetuximab soravtansine? |
| 18. What is the proposed mechanism of the ocular toxicity of mirvetuximab soravtansine? |
| 19. How should mirvetuximab soravtansine ocular adverse effects be managed? |
| 20. Is there a specific mitigation plan for preventing ocular adverse effects from mirvetuximab soravtansine? |
| 21. What are some warning signs patients should be aware of for mirvetuximab soravtansine ocular toxicity? Be specific. |
| 22. When should mirvetuximab soravtansine be discontinued due to ocular toxicity? |
Antibody drug conjugate (ADC) questions inputted into ChatGPT, Bard, and LLaMA. Six general ADC, eight Tisotumab vedotin (TV), and eight Mirvetuximab soravtansine (MIRV) questions were asked.
Likert Scales.
Answers were rated on accuracy and completeness using 6-point, standardized Likert scales. Accuracy was rated as 1–completely incorrect; 2–more incorrect than correct; 3–approximately equal correct and incorrect; 4–more correct than incorrect; 5–nearly all correct; 6–correct. Completeness was rated as 1–very incomplete, addresses less than half of the question, but significant parts are missing or incomplete; 2–somewhat incomplete, addresses around half of the question, but significant parts are missing or incomplete; 3–slightly incomplete, addresses more than half of the question, but parts are missing or incomplete; 4–adequate, addresses all aspects of the question, but provides the minimum amount of information required to be considered complete; 5–comprehensive, addresses all aspects of the question with detail; 6–comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected.
Statistics.
Mean, standard deviations (SD), and Fleiss kappa values (1–6 for accuracy and completeness) to assess for interrater variability were calculated for the three question categories. ANOVA tests were conducted between the groups followed by post-hoc pairwise t-tests to compare ratings between LLMs. This study was considered exempt by the Johns Hopkins Institutional Review Board.
Results
Twenty-two ChatGPT, Bard, and LLaMA answers regarding ADC ocular toxicities were assessed with an overall mean accuracy and completeness scores for ChatGPT (Accuracy: 4.63, SD 0.90; Completeness: 4.43, SD 0.91), Bard (A: 4.77 SD 0.90; C: 4.57, SD 0.93) and LLaMA (A: 4.40, SD 1.09; C: 4.42, SD 0.96) (Table 2). No answers received the lowest accuracy or completeness score. 15% received the highest accuracy score (15% ChatGPT, 17% Bard, 14% LLaMA), and 14% received the highest completeness score (10% ChatGPT, 17% Bard, 14% LLaMA). The three LLMs significantly differed in accuracy (p=0.05) but not completeness (p=0.52) (Table 3). ChatGPT was significantly better rated than LLaMa for accuracy (p=0.03). Bard was also rated higher than LLaMa (p=0.003) for accuracy. Interrater kappa assessment was good (0.74) for accuracy and fair (0.31) for completeness.
Table 2:
Questions Ratings on Accuracy and Completeness between 3 AI Chatbots
| Accuracy | Completeness | ||||||
|---|---|---|---|---|---|---|---|
| ChatGPT | Bard | LLaMA | ChatGPT | Bard | LLaMA | ||
| All Questions, n=22 | Mean | 4.63 | 4.77 | 4.41 | 4.43 | 4.57 | 4.42 |
| SD | 0.90 | 0.90 | 1.09 | 0.921 | 0.93 | 0.99 | |
| General ADC Questions, n=6 | Mean | 4.96 | 4.79 | 4.58 | 4.75 | 4.29 | 4.50 |
| SD | 0.81 | 0.98 | 1.32 | 0.90 | 0.95 | 1.35 | |
| TV Questions, n=8 | Mean | 4.63 | 4.81 | 4.19 | 4.41 | 4.72 | 4.31 |
| SD | 1.04 | 0.86 | 1.15 | 1.07 | 0.92 | 0.90 | |
| MIRV Questions, n=8 | Mean | 4.38 | 4.72 | 4.50 | 4.22 | 4.63 | 4.47 |
| SD | 0.75 | 0.92 | 0.84 | 0.71 | 0.91 | 0.76 | |
Mean, standard deviation (SD), and Fleiss Kappa values of antibody drug conjugate (ADC) question categories, including all questions, general ADC questions, Tisotumab vedotin (TV) questions, and Mirvetuximab soravtansine (MIRV) questions. Accuracy was evaluated with a 6 point scale (1–completely incorrect, 2– more incorrect than correct, 3–approximately equal correct and incorrect, 4–more correct than incorrect, 5– nearly all correct, 6 –correct), and completeness was rated with a 6 point scale (1–very incomplete, addresses less than half of the question, but significant parts are missing or incomplete; 2–somewhat incomplete, addresses around half of the question, but significant parts are missing or incomplete; 3–slightly incomplete, addresses more than half of the question, but parts are missing or incomplete; 4– adequate, addresses all aspects of the question, but provides the minimum amount of information required to be considered complete; 5–comprehensive, addresses all aspects of the question with detail; 6–comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected).
Table 3:
ANOVA Test P Values Between Comparison Groups
| ANOVA Test P Values for Accuracy and Completeness Between LLMs in each Question Subcategory | ||||
|---|---|---|---|---|
| All Questions | General ADC Questions | TV Questions | MIRV Questions | |
| Accuracy | 0.05 | 0.47 | 0.05 | 0.26 |
| Completeness | 0.52 | 0.34 | 0.22 | 0.13 |
| ANOVA Test P Values for Accuracy and Completeness between Question Subcategory for each LLM | ||||
| ChatGPT | Bard | LLaMA | ||
| Accuracy | 0.05 | 0.91 | 0.35 | |
| Completeness | 0..10 | 0.10 | 0.74 | |
P values for ANOVA tests between comparison groups. P values <0.05 were considered statistically significant and are bolded in the table. Abbreviations: ADC: antibody drug conjugate; LLM: large language model; TV: Tisotumab vedotin; MIRV: Mirvetuximab soravtansine.
The six general questions about ADC drug design and mechanism of action were rated with mean scores for ChatGPT (A: 4.93, SD 0.81; C: 4.75, SD 0.90), Bard (A: 4.79, SD 0.98; Cs: 4.29, SD 0.95) and LLaMA (A: 4.58, SD 1.32; C: 4.50 SD 1.35). There were no significant differences in accuracy (p=0.47) or completeness (p=0.35) between the 3 LLMs on general ADC questions. Interrater variability was rated as moderate (0.63) for accuracy and fair (0.34) for completeness.
Eight TV questions were rated on accuracy and completeness: ChatGPT (A: 4.63, SD 1.04; C: 4.41, SD 1.07), Bard (A:4.81, SD 0.86; C: 4.72, SD 0.92) and LLaMA (A: 4.19, SD 1.15; C: 4.31, SD 0.90). There were significant differences in accuracy (p=0.05), with ChatGPT (p=0.02) and Bard (p=0.006) both rated more accurate than LLaMa. There was no significant difference in completeness between groups (p=0.22). Interrater variability was rated as very good (0.99) for accuracy and good (0.79) for completeness.
Among the eight MIRV questions, mean accuracy and completeness were rated with ChatGPT (A: 4.34, SD 0.75; C: 4.21, SD 0.71), Bard (A: 4.72 SD 0.92; C: 4.63, SD 0.91) and LLaMA (A: 4.50, SD 0.84; C: 4.47, SD 0.76). Significant differences were not noted for accuracy (p=0.26) or completeness (p=0.13). Interrater assessment was very good (0.83) for accuracy and poor (0.08) for completeness.
Between the three question subcategories, there were no significant differences in accuracy for Bard (p=0.91) or LLaMa (p=0.95) and no differences in completeness for any of the three LLMs (p=0.10, p=0.22, p=0.74). ChatGPT had significant differences for accuracy between the question categories (p=0.05). Specifically, ChatGPT scored better (p=0.004) with general questions (4.75, SD 0.90) when compared to MIRV questions (4.22, SD 0.71). There was no significant difference (p=0.09) between ChatGPT general (4.75, SD 0.90) and TV questions (4.41 SD 1.07).
Discussion:
Overall all three LLMs were scored as accurate with mean ratings between 4–more correct than incorrect and 5–nearly all correct. Similarly, all three LLMs were scored in between 4-adequate and 5-comprehensive for completeness, however, interrater variability (0.31) was lower for completeness than accuracy. Overall both ChatGPT and Bard were rated as more accurate than LLaMa. In similar studies, ChatGPT version 4 has been shown to outperform Bard [10, 11] for common ophthalmic complaints and LlaMa for the multispecialty recruitment exam [12]. However, our study is unique as Bard was found to perform overall just as well as ChatGPT version 4 and even outperformed ChatGPT in the MIRV question subcategory.
These results are comparable to prior studies that show LLMs are able to provide high quality answers about corneal diseases, uveitis, and other comprehensive ophthalmology pathologies [3, 11, 13, 14]. This study showed similar results to these boarder ophthalmic topics, indicating that LLMs are also helpful for rarer topics in ophthalmology. In the three question subcategories, ChatGPT scored more accurately in general ADC (4.75) questions when compared to specific MIRV (4.22) ocular toxicity evaluations (p=0.004). This may reflect that, while LLMs were generally rated highly, they are still more equipped in providing general knowledge when compared to more complex and specific ocular toxicities.
TV and MIRV are still novel drugs in the beginning of their utilization. The current ocular toxicity treatment plan often involves treatment de-escalation, pauses, or discontinuation requiring close coordination between ophthalmologists and oncologists. However, the pathophysiology of their toxicities is still not fully understood [15], and treatment plans may continue to evolve as research expands and novel preventative therapies for ocular toxicity develop [16]. Our study supports the hypothesis that AI chatbots have a role in emerging research topics in ophthalmology. Specifically, they can synthesize large bodies of new research into straightforward responses [17]. However, LLMs are still significantly limited based on their most recent corpus update. For example, ChatGPT version 4 is currently trained up to April 2023, underscoring its current inability to provide the most updated medical information.
There are several limitations to this study. For one, the sample size of four physicians is small and may not represent the differing assessments and opinions about treating ADC ocular toxicities. Similarly, twenty-two questions cannot fully capture the intricate toxicities of these novel drugs. This study also relied on subjective survey-based answers prone to survey bias and was limited by poor inter-rater variability for overall completeness scores (kappa 0.31) and also completeness of general ADC (kappa 0.34) and MIRV (kappa 0.08) questions. Future studies with a larger sample size and a wider variety of ocular toxicities are warranted to make stronger conclusions.
Proposed roles for LLMs in ophthalmology include ophthalmic education, developing differential diagnoses, synthesizing literature for research, and making care more accessible [17, 18]. This study supports the hypothesis that LLMs can help with providing simple, accurate responses for niche topics in ophthalmology including ADC associated ocular toxicities. As LLMs continue to develop, both patients and medical professionals are exploring how AI chatbots can assist in ophthalmic medical care and education. Our study shows that ChatGPT and Bard both out-performed LLaMA, and were able to give comparably accurate answers to more general ophthalmic topics. Future studies that evaluate LLMs’ ability to remain current with updated research and continue to provide relevant answers about ADC ocular toxicity will be of great interest.
Funding:
Dracopoulos Uveitis Research Fund.
Footnotes
Conflict of interest: Rayna Marshall: none, Hannah Xu: none, Lauren A. Dalvin: none, Kapil Mishra: none, Camellia Edalat: none, Nila Kirupaharan: none, Jasmine H. Francis: none, Meghan Berkenstock: Seagen Incorporated, Sanofi Pharmaceuticals
Disclosures: Rayna Marshall: none, Hannah Xu: none, Lauren A. Dalvin: none, Kapil Mishra: none, Camellia Edalat: none, Nila Kirupaharan: none, Jasmine H. Francis: none, Meghan Berkenstock: Seagen Incorporated, Sanofi Pharmaceuticals
Data availability:
No dataset is associated with this study.
References
- 1.Tan TF, et al. , Generative Artificial Intelligence Through ChatGPT and Other Large Language Models in Ophthalmology: Clinical Applications and Challenges. Ophthalmol Sci, 2023. 3(4): p. 100394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tan Yip Ming C, et al. , The Potential Role of Large Language Models in Uveitis Care: Perspectives After ChatGPT and Bard Launch. Ocul Immunol Inflamm, 2023: p. 1–5. [DOI] [PubMed] [Google Scholar]
- 3.Delsoz M, et al. , Performance of ChatGPT in Diagnosis of Corneal Eye Diseases. Cornea, 9900. [DOI] [PubMed] [Google Scholar]
- 4.Jiao C, et al. , Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge. Cureus, 2023. 15(9): p. e45700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.de Bono JS, et al. , Tisotumab vedotin in patients with advanced or metastatic solid tumours (InnovaTV 201): a first-in-human, multicentre, phase 1–2 trial. Lancet Oncol, 2019. 20(3): p. 383–393. [DOI] [PubMed] [Google Scholar]
- 6.Coleman RL, et al. , Efficacy and safety of tisotumab vedotin in previously treated recurrent or metastatic cervical cancer (innovaTV 204/GOG-3023/ENGOT-cx6): a multicentre, open-label, single-arm, phase 2 study. Lancet Oncol, 2021. 22(5): p. 609–619. [DOI] [PubMed] [Google Scholar]
- 7.Martin LP, et al. , Characterization of folate receptor alpha (FRα) expression in archival tumor and biopsy samples from relapsed epithelial ovarian cancer patients: A phase I expansion study of the FRα-targeting antibody-drug conjugate mirvetuximab soravtansine. Gynecol Oncol, 2017. 147(2): p. 402–407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Matulonis UA, et al. , Efficacy and Safety of Mirvetuximab Soravtansine in Patients With Platinum-Resistant Ovarian Cancer With High Folate Receptor Alpha Expression: Results From the SORAYA Study. J Clin Oncol, 2023. 41(13): p. 2436–2445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Richardson DL, Ocular toxicity and mitigation strategies for antibody drug conjugates in gynecologic oncology. Gynecol Oncol Rep, 2023. 46: p. 101148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zandi R, et al. , Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering (Basel), 2024. 11(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Al-Sharif EM, et al. , Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plastic & Reconstructive Surgery, 9900: p. 10.1097/IOP.0000000000002567. [DOI] [PubMed] [Google Scholar]
- 12.Tsoutsanis P and Tsoutsanis A, Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam. Computers in Biology and Medicine, 2024. 168: p. 107794. [DOI] [PubMed] [Google Scholar]
- 13.Marshall RF, et al. , Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT. Ocular Immunology and Inflammation: p. 1–4. [DOI] [PubMed] [Google Scholar]
- 14.Bernstein IA, et al. , Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Network Open, 2023. 6(8): p. e2330320–e2330320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nguyen TD, Bordeau BM, and Balthasar JP, Mechanisms of ADC Toxicity and Strategies to Increase ADC Tolerability. Cancers (Basel), 2023. 15(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lindgren ES, et al. , Incidence and Mitigation of Corneal Pseudomicrocysts Induced by Antibody–Drug Conjugates (ADCs). Current Ophthalmology Reports, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kedia N, et al. , ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Betzler BK, et al. , Large language models and their impact in ophthalmology. Lancet Digit Health, 2023. 5(12): p. e917–e924. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No dataset is associated with this study.
