This cross-sectional study assesses the accuracy, understandability, and actionability of responses generated by 4 artificial intelligence chatbots to the most common search queries about the 5 most common cancers.
Key Points
Question
What is the quality of cancer-related health information outputted by artificial intelligence (AI) chatbots?
Findings
In this cross-sectional study, the responses of 4 AI chatbots to the top search queries related to the 5 most prevalent cancers were high quality but were written at a college reading level and had poor actionability.
Meaning
Findings of this study suggest that AI chatbots are an accurate and reliable supplementary resource for medical information but are limited in their readability and should not replace health care professionals for individualized health care questions.
Abstract
Importance
Consumers are increasingly using artificial intelligence (AI) chatbots as a source of information. However, the quality of the cancer information generated by these chatbots has not yet been evaluated using validated instruments.
Objective
To characterize the quality of information and presence of misinformation about skin, lung, breast, colorectal, and prostate cancers generated by 4 AI chatbots.
Design, Setting, and Participants
This cross-sectional study assessed AI chatbots’ text responses to the 5 most commonly searched queries related to the 5 most common cancers using validated instruments. Search data were extracted from the publicly available Google Trends platform and identical prompts were used to generate responses from 4 AI chatbots: ChatGPT version 3.5 (OpenAI), Perplexity (Perplexity.AI), Chatsonic (Writesonic), and Bing AI (Microsoft).
Exposures
Google Trends’ top 5 search queries related to skin, lung, breast, colorectal, and prostate cancer from January 1, 2021, to January 1, 2023, were input into 4 AI chatbots.
Main Outcomes and Measures
The primary outcomes were the quality of consumer health information based on the validated DISCERN instrument (scores from 1 [low] to 5 [high] for quality of information) and the understandability and actionability of this information based on the understandability and actionability domains of the Patient Education Materials Assessment Tool (PEMAT) (scores of 0%-100%, with higher scores indicating a higher level of understandability and actionability). Secondary outcomes included misinformation scored using a 5-item Likert scale (scores from 1 [no misinformation] to 5 [high misinformation]) and readability assessed using the Flesch-Kincaid Grade Level readability score.
Results
The analysis included 100 responses from 4 chatbots about the 5 most common search queries for skin, lung, breast, colorectal, and prostate cancer. The quality of text responses generated by the 4 AI chatbots was good (median [range] DISCERN score, 5 [2-5]) and no misinformation was identified. Understandability was moderate (median [range] PEMAT Understandability score, 66.7% [33.3%-90.1%]), and actionability was poor (median [range] PEMAT Actionability score, 20.0% [0%-40.0%]). The responses were written at the college level based on the Flesch-Kincaid Grade Level score.
Conclusions and Relevance
Findings of this cross-sectional study suggest that AI chatbots generally produce accurate information for the top cancer-related search queries, but the responses are not readily actionable and are written at a college reading level. These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information.
Introduction
Large language models and natural language processing tools are designed to understand natural language and generate responses to text-based prompts. Such artificial intelligence (AI) chatbots generate responses based on a large corpus of data including articles, websites, and other publicly available text data.1
Artificial intelligence chatbots are rapidly becoming a primary source of information for patients, and chatbot responses to medical queries may influence health-related behavior. Among AI chatbots, ChatGPT (OpenAI) has shown promise in accurately answering medical questions, even US Medical Licensing Examination–style questions.2 Additionally, a multispecialty team of academic physicians found ChatGPT’s responses to a diverse set of medical queries to be largely accurate and comprehensive.1
However, the quality of AI chatbot responses to frequently searched medical queries, especially those related to cancer, has not yet been evaluated through validated instruments. To our knowledge, no study has reported on the quality, understandability, actionability, and readability of AI chatbots’ responses to cancer-related medical queries. To this end, we assessed the quality of responses generated by 4 AI chatbots to the most common search queries concerning the 5 most common cancers in the US.3
Methods
This cross-sectional study was exempt from review and informed consent in accordance with the Common Rule given its use of publicly available data. The Standards for Reporting Qualitative Research (SRQR) reporting guideline was followed.
The top Google (Alphabet, Inc) search queries related to lung, skin, colorectal, breast, and prostate cancers in the US from January 1, 2021, to January 1, 2023, were identified using Google Trends. This publicly available web tool provides data on search-term volume for the Google search engine over time. The top 5 search queries related to each of these 5 cancers were identified and input into 4 AI chatbots: ChatGPT version 3.5, Perplexity (Perplexity.AI), Chatsonic (Writesonic), and Bing AI (Microsoft). eTable 1 in Supplement 1 lists the search terms and eTable 2 in Supplement 1 provides the sample responses. These chatbots were used in their default settings with the most updated publicly accessible version as of April 4, 2023. For the 3 chatbots with customizable settings, Chatsonic was calibrated in its default setting as Google-integrated, concise response with disabled memory; Perplexity was calibrated in its default setting as concise; and Bing AI was calibrated to include balanced results. For each new search query, a new conversation was initiated to avoid the memory of previous queries from influencing the responses to subsequent queries. The inputs into AI chatbots matched the exact phrasing from Google Trends.
Each response was evaluated for the quality of information using 2 validated instruments: DISCERN4 (overall scores of 1 [low] to 5 [high] for quality of information) and the understandability and actionability domains of the Patient Education Materials Assessment Tool (PEMAT)5 (scores of 0%-100%, with higher scores indicating a higher level of understandability and actionability). Both instruments show strong interrater reliability.4,6 The presence of misinformation was analyzed using a published 5-point Likert scale (range, 1 [none] to 5 [high])7 based on the National Comprehensive Cancer Network (NCCN) guidelines.8 Two of us (A.P., D.M.) independently scored responses based on the NCCN guidelines; the scorers were blinded to the AI chatbot type and to each other’s scores. There was 95% agreement between the 2 scorers. Readability was evaluated by the Flesch-Kincaid Grade Level (range, 5 [easy to read] to 16 [most difficult to read]).9 Citations from AI chatbot responses also were recorded to evaluate the source of information used to generate a response. The AI-generated texts were recorded, and evaluation scores were analyzed with Excel, version 16.20 (Microsoft Corp).
Results
A total of 100 chatbot responses to the top 5 search queries for skin, colorectal, prostate, lung, or breast cancer were analyzed. Each of the queries related to these cancers included the words cancer symptoms and what is [the cancer] (eTable 1 in Supplement 1).
The 4 AI chatbots received the same query inputs, and the total median DISCERN scores for their responses ranged from 4 to 5, indicating high-quality information (median [range] DISCERN score, 5 [2-5]; Table). The 5-point Likert scale scores showed a lack of misinformation from AI chatbot responses based on NCCN guidelines. The median PEMAT understandability domain scores ranged from 50.0% to 72.7% (median [range] score, 66.7% [33.3%-90.1%]), indicating moderate understandability, while the median PEMAT actionability domain scores ranged from 0% to 40.0% (median [range] score, 20.0% [0%-40.0%]), indicating poor actionability.
Table. Evaluation of Artificial Intelligence Chatbot Responses to Search Queries Relating to Cancer.
Measure | Median (range) | ||||
---|---|---|---|---|---|
ChatGPT | Perplexity | Chatsonic | Bing AI | Overall score | |
DISCERN scorea | 4 (2-5) | 5 (3-5) | 5 (3-5) | 5 (3-5) | 5 (2-5) |
PEMAT understandability score, %b | 70.0 (41.7-88.9) | 63.6 (33.3-77.8) | 50.0 (33.3-77.8) | 72.7 (40.0-90.1) | 66.7 (33.3-90.1) |
PEMAT actionability score, %b | 40 (0-40) | 0 (0-40) | 20 (0-40) | 40 (0-40) | 20 (0-40) |
Misinformation scorec | 1 (1-1) | 1 (1-1) | 1 (1-1) | 1 (1-1) | 1 (1-1) |
Flesch-Kincaid Grade Leveld | 11.3 (5.7-28.9) | 12.0 (8.1-17.1) | 12.7 (9.7-18.6) | 11.6 (5.2-21.5) | 12.3 (5.2-28.9) |
Word count | 146 (74-243) | 91 (47-128) | 128 (70-360) | 103 (51-343) | 120 (47-360) |
Top reference sources (No.) | None | Mayo Clinic (22); CDC (16); American Cancer Society (10) | American Cancer Society (16); Mayo Clinic (15); CDC (8) | Mayo Clinic (25); American Cancer Society (14); CDC (14) | Mayo Clinic (62); American Cancer Society (40); CDC (32) |
Abbreviations: CDC, Centers for Disease Control and Prevention; PEMAT, Patient Education Material Assessment Tool.
Measured the quality of health consumer information, scored from 1 (low) to 5 (high).
Scored from 0% (low) to 100% (high).
Scored from 1 (no misinformation) to 5 (high misinformation).
Scored from 5 (easy to read) to 16 (most difficult to read).
The AI chatbot responses ranged from medians of 91 to 146 words, with ChatGPT generating the longest text. Median Flesch-Kincaid Grade Level scores ranged from 11.3 to 12.7, indicating a college reading level (Table). Top-cited sources from the Perplexity, Chatsonic, and Bing AI chatbots included government, hospital-affiliated, and independent volunteer health organizations; in contrast, ChatGPT did not cite any sources in its responses.
Discussion
According to DISCERN and misinformation scores, the 4 AI chatbots in this analysis generated high-quality responses about 5 common cancers for consumers and did not appear to spread misinformation. Three of 4 chatbots cited reputable sources, such as the American Cancer Society, Mayo Clinic, and Centers for Disease Control and Prevention. Citation of information from reputable sources is reassuring given that many studies have found substantial dissemination of misinformation about cancer through social networks.7,10
However, the responses of AI chatbots to cancer-related inquiries were written at a college reading level and were poorly actionable. This finding suggests that AI chatbots use medical terminology that may not be familiar or useful for lay audiences. Additionally, AI chatbots are limited in their ability to explain complex medical concepts without the use of visual aids. Concepts such as swollen lymph node can be challenging to explain with text alone. Moreover, the AI-generated responses were often concise, with a median word count ranging from 91 to 146 words. A concise response may not generate enough information to sufficiently explain complex medical topics to users. However, 3 of the 4 AI chatbots cited sources for users to learn more about the topics.
It is important for the public to understand that cancer symptoms can be wide ranging and can overlap with those of other health conditions, highlighting the importance of speaking with a health professional. The AI chatbots typically acknowledged their limitations in providing individualized advice and encouraged users to seek medical attention. Future studies are warranted to determine whether AI chatbots could improve clinical outcomes compared with other online sources of information by prompting users to seek individual medical attention.
Strengths and Limitations
To our knowledge, this is the first study to evaluate the quality of consumer health information from AI chatbots related to the 5 most common cancers. As the use of these platforms continues to expand, an assessment of health information quality is essential.
This study was limited to queries of AI chatbots based on the most popular internet searches according to Google Trends because search trends of the AI chatbots themselves are not publicly available. The phrasing of inputs into the AI chatbot may affect the type and quality of information generated. Additionally, AI chatbots can generate alternative responses to the same query and answer follow-up questions to clarify initial responses, which might affect the overall quality, understandability, and actionability of the information provided. Despite these limitations, our goal was to provide an initial quality assessment on AI chatbots’ responses to popular search queries about cancer. Further investigations to evaluate an AI chatbot’s ongoing dialogue with more complex questions, the information quality from paid AI chatbots, and whether varying the words and sentence sequences used affects response quality are warranted.
Conclusions
Artificial intelligence chatbots are becoming a major source of medical information for consumers. Findings of this cross-sectional study suggest that they generally produce reliable and accurate medical information about lung, breast, colorectal, skin, and prostate cancers. However, the usefulness of the information is limited by its poor readability and lack of visual aids. These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information. To this end, AI chatbots typically encourage users to seek medical attention relating to cancer symptoms and treatment.
References
- 1.Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. Preprint posted online February 28, 2023. doi: 10.21203/rs.3.rs-2566942/v1 [DOI]
- 2.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73(1):17-48. doi: 10.3322/caac.21763 [DOI] [PubMed] [Google Scholar]
- 4.Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. 1999;53(2):105-111. doi: 10.1136/jech.53.2.105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Agency for Healthcare Research and Quality . The Patient Education Materials Assessment Tool (PEMAT) and user’s guide. Accessed April 4, 2023. https://www.ahrq.gov/health-literacy/patient-education/pemat.html
- 6.Shoemaker SJ, Wolf MS, Brach C. Development of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ Couns. 2014;96(3):395-403. doi: 10.1016/j.pec.2014.05.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Loeb S, Sengupta S, Butaney M, et al. Dissemination of misinformative and biased information about prostate cancer on YouTube. Eur Urol. 2019;75(4):564-567. doi: 10.1016/j.eururo.2018.10.056 [DOI] [PubMed] [Google Scholar]
- 8.National Comprehensive Cancer Network . National Comprehensive Cancer Network (NCCN) guidelines. Accessed April 4, 2023. http://www.nccn.org
- 9.Ley P, Florio T. The use of readability formulas in health care. Psychol Health Med. 1996;1:1–, 7-28.. doi: 10.1080/13548509608400003 [DOI] [Google Scholar]
- 10.Johnson SB, Parsons M, Dorff T, et al. Cancer misinformation and harmful information on Facebook and other social media: a brief report. J Natl Cancer Inst. 2022;114(7):1036-1039. doi: 10.1093/jnci/djab141 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.