Skip to main content
BMC Public Health logoLink to BMC Public Health
. 2025 Jun 2;25:2038. doi: 10.1186/s12889-025-23206-0

Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news

Xiaoyu Liu 1,, Lu He 2, Eman Alanazi 3, Echu Liu 1, Arianna Goss 1, Lionel Gumireddy 1
PMCID: PMC12128262  PMID: 40457340

Abstract

Background

With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large language model (LLM), in rating the quality of health news and providing explanatory justification for the rating assessment.

Methods

We evaluated GPT-3.5-Turbo’s performance on 3222 health news articles from an expert-annotated dataset compiled by HealthNewsReview.org, which assesses the quality of health news across nine criteria. GPT-3.5-Turbo was prompted with standardized queries tailored to each criterion. We measured its rating performance using 95% confidence intervals for precision, recall, and F1 scores in binary classification (satisfactory/not satisfactory). Additionally, linguistic complexity, readability, and the quality of GPT-3.5-Turbo’s explainability were assessed through both quantitative linguistic analysis and qualitative evaluation of consistency and contextual relevance.

Results

GPT-3.5-Turbo’s rating performance varied across criteria, with the highest accuracy for the Cost criterion (F1 = 0.824) but lower accuracy for Benefit, Conflict, and Quality criteria (F1 < 0.5), underperforming compared to traditional supervised machine learning models. However, its explanations were clear, with readability suited to late high school or early college levels and scored highly for consistency (average score: 2.90/3) and contextual relevance (average score: 2.73/3). These findings highlight GPT-3.5-Turbo’s strength in providing understandable and contextually relevant explanations, despite that its rating accuracy is limited.

Conclusion

While GPT-3.5-Turbo’s rating accuracy requires improvement, its strength in offering comprehensible and contextually relevant explanations presents a valuable opportunity to enhance public understanding of health news quality. Leveraging LLMs as complementary tools for health literacy initiatives could help mitigate misinformation by facilitating non-expert audiences to interpret and assess health information.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12889-025-23206-0.

Keywords: Generative AI, Large Language model, ChatGPT, GPT-3, Health misinformation, Health news, Public health informatics

Background

The internet is a primary information source for individuals to search for health-related topics [1]. However, online health information is of mixed quality, often containing misinformation that requires information consumers to assess and judge their quality [2]. Addressing the prevalence of health misinformation online is crucial for safeguarding public health and ensuring that individuals can access reliable and trustworthy information to make informed decisions about their health [3].

Health misinformation or low-quality health information is prevalent across many health domains, ranging from infectious diseases (COVID-19, Ebola), chronic conditions (cancer, diabetes), to lifestyle (food and nutrient) [4]. Health misinformation and low-quality information can contribute to a myriad of adverse outcomes, including the adoption of harmful practices [5], the rejection of proven medical interventions [6], and the exacerbation of existing health disparities [7]. Moreover, misinformation can erode trust in authoritative sources [8], undermining efforts to disseminate accurate information and promote evidence-based practices. Additionally, vulnerable populations, such as those with limited health and digital literacy and language barriers or from underrepresented communities, are particularly susceptible to the effects of health misinformation, further widening health inequities [9]. While healthcare professionals may engage in correcting misinformation, their capability is difficult to scale to the prevalence and constantly evolving health misinformation. As a result, there is a growing need for automated tools to assess the quality of health news at scale to ensure that consumers are equipped with accurate and trustworthy information.

Artificial intelligence (AI) is a promising tool to facilitate the rapid dissemination of fact-checked information by automating the process of verifying claims and providing users with accurate and up-to-date information [10]. In addition, compared to professional fact-checkers, journalists, and authoritative entities that are often perceived as traditional approaches, AI-based misinformation checking can serve as a faster, scalable, and potentially unbiased tool for identifying false information across diverse platforms. By harnessing AI technologies, public health authorities, researchers, and technology companies can collectively develop scalable solutions for detecting and addressing health misinformation, ultimately promoting health literacy and empowering individuals to make informed decisions about their health.

Existing AI studies on misinformation classification and information quality assessment can be categorized into two broad approaches: veracity-based and criteria-based [11]. Veracity-based approaches focus on assessing the factual accuracy of health-related claims by cross-referencing them with reliable sources [1215]. For example, Meppelink et al. [13] applied automated classification techniques to analyze vaccine-related webpages, determining information to be reliable if it is aligned with the authoritative guidelines. In contrast, criteria-based approaches focus on evaluating the quality of health information by using predefined metrics, such as the credibility of the source, without necessarily verifying the accuracy of individual claims [1618]. Both methods are essential but differ in focus: veracity-based approaches assess authenticity, while criteria-based approaches evaluate content quality. Criteria-based approaches can provide information consumers more granular and detailed insights beyond the dichotomized authenticity and can also cultivate the ability among information consumers to discern low-quality health information.

Besides accurately rating the quality and validity of health information, it is also essential to provide relevant and adequate explanations to the rating. Existing literature highlights the importance of providing transparent explanations to foster trust in AI-generated decisions, particularly in fact-checking applications [1923]. Information consumers expect not only truthfulness labels but also clear, persuasive explanations. However, most AI models built for health content evaluation fall short in providing this level of transparency. Existing models that claim to be explainable, as demonstrated in the work of Ayoub et, al. [24] often rely on independent interpretable machine learning methods like LIME [25] or SHAP [26], which focus on explaining how specific features influence predictions. While these tools provide insight into the model’s decision-making process, they often do not provide readable and context-relevant explanations that appeal to laypeople. This underscores the need to move beyond simply explaining model decisions, to ensuring that explanations help users better understand the justification involved in health information evaluation.

Generative Pre-trained Transformer (GPT), an advanced LLM developed by OpenAI, has received increasing attention due to its capability in engaging in human-like conversations and promising performance in multiple domains [27, 28]. Pre-trained on massive real-world internet data, GPT is well-positioned to analyze and detect patterns from online texts such as health news and provide readable guides to users. Existing studies have tested the performance of GPT on identifying and responding to health misinformation [27, 29], but these evaluations have typically focused on brief health claims or myths. Similarly, Luo et al. [30] assessed the accuracy, effectiveness, readability, and applicability of GPT-4 and ERNIE on 20 health rumors. Choi et al. [31] evaluated whether pre-trained and fine-tuned LLMs can match claims made in COVID-related social media posts to previously fact-checked claims. These studies offered rich insights into the feasibility of LLMs for evaluating health information quality. However, to our knowledge, there is no study that evaluated LLMs’ ability to assess health information quality by pre-defined criteria that is widely adopted by human fact-check experts. Such evaluation is crucial as it can reveal along which criteria that LLMs may fall short and need additional fine-tuning. In addition, there lacks a systematic evaluation of GPT and other LLMs in the context of more complex health narratives, such as comprehensive reports from health news outlets that include descriptions of medical interventions, which better represent the health information that laypeople consume, compared to expert-curated datasets. Furthermore, few studies have employed validated, structured evaluation frameworks, and there has been limited investigation into GPT’s capacity to explain its explainability in a manner that is accessible and comprehensible to lay audiences.

To address these gaps, this study evaluates the potential of GPT-3.5-Turbo, a representative and widely used LLM, in assessing the quality of health news articles using pre-defined, rigorous criteria. The study also employed mixed-methods to assess the linguistic complexity, readability, and the quality of the explanations provided by GPT-3.5-Turbo. This study specifically aims to determine how well LLM can replicate the human expert evaluation process used in health news quality assessment, which involves providing both a numeric rating and natural language explanations for the ratings.

Methods

Datasets and criteria for health news evaluation

The dataset for this study was derived from HealthNewsReview.org, an online platform dedicated to assessing the completeness, accuracy, and balance of health news stories. These news stories encompass a wide range of health topics, including medical treatments, tests, products, and procedures. As of the end of its operation in 2018, the platform had reviewed 3,222 health news articles, with each evaluation detailed on its website. An example of the review is illustrated in Fig. 1.

Fig. 1.

Fig. 1

The snapshot of a review case by HealthNewsReview.org [32] on a news article about vitamin D and prostate cancer, scoring 5/10. The review highlights inadequate discussion based on criteria such as costs, harms, alternatives, and novelty, while noting strengths in other areas

To evaluate the quality of health news, the website adopted a rating instrument that comprised ten criteria. These criteria are consistent with those used by the Australian and Canadian Media Doctor websites and are in alignment with the Association of Health Care Journalists’ Statement of Principles [33]. They are designed to evaluate the health news across several dimensions:

  1. Comprehensive discussion of costs (the Cost criterion).

  2. Quantification of the medical intervention’s benefits (the Benefit criterion).

  3. Detailed explanation and quantification of potential intervention harms (the Harm criterion).

  4. Comparison of new ideas with existing alternatives (the Alternative criterion).

  5. Inclusion of independent sources and disclosure of potential conflicts of interest (the Conflict criterion).

  6. Avoidance of disease mongering (the Mongering criterion).

  7. Evaluation of the evidence quality (the Quality criterion).

  8. Determination of the intervention’s true novelty (the Quality criterion).

  9. Confirmation of the intervention’s availability (the Availability criterion).

  10. Reliance on a news release.

Each health news article is rated as “satisfactory,” “not satisfactory,” or “not applicable” for each criterion by a panel of three reviewers with relevant expertise. The inter-reviewer reliability was reported to be 74% [33].

For our study, we extracted the full text of each reviewed health news article from the original news outlets along with their evaluations from HealthNewsReview.org. Our study employs the first nine criteria outlined by HealthNewsReview.org. We excluded the tenth criterion regarding reliance on press releases because its assessment necessitates cross-verification with external news sources, a requirement that does not apply to the first nine criteria. Additionally, we consolidated the “not satisfactory” and “not applicable” ratings into a single “not satisfactory” category.

Prompt engineering

We performed several rounds of prompt testing and iterated the prompt design to optimize the final prompts used. Given the substantial influence that prompt engineering exerts on the output of LLMs, we standardized our prompts for each criterion of evaluation. This standardization encompassed several key components:

  • Criterion-Specific Question: Each evaluation criterion was paired with a specific question designed to elicit targeted responses from the generative LLM.

  • Criterion Explanation: Adaptations from the criteria guidelines provided by HealthNewsReview.org were included to instruct how a health news article should be assessed as satisfactory or unsatisfactory. This explanation aimed to guide the evaluation process by providing a clear standard for judgment.

  • Output Format Rules: The prompts were structured to dictate the format of ChatGPT’s responses, ensuring consistency across the generated outputs. This component was critical for facilitating a standardized analysis of the responses.

  • Full Text of Health News: The health news article to be evaluated was appended to the end of each prompt, ensuring that the model had all necessary information for its evaluation.

The rationale behind this methodical approach to prompt standardization was informed by findings from our pilot study. Our pilot testing indicated that using separate prompts for each criterion can reduce delays and disruptions in communication with the model’s API, compared to a single, combined prompt for all criteria. Moreover, incorporating detailed criterion explanations and explicit format requirements was instrumental in enhancing the accuracy of the evaluations and maintaining a consistent output format. We also experimented with several widely used prompting strategies such as the Chain-of-Thought (CoT) [34], which guides the model to think logically step by step. However, our pilot testing found that the model did not produce satisfactory results when using CoT-based prompts. It warrants future investigation into why CoT performs better in other tasks such as mathematical reasoning but not information-extensive tasks such as health information evaluation.

In Fig. 2, we provide an example prompt for the Cost criterion along with ChatGPT’s response. Detailed information regarding the prompts for the remaining criteria is available in supplementary material.

Fig. 2.

Fig. 2

Example prompt for Cost criterion evaluation and response by ChatGPT (GPT-3.5 Turbo). The prompt includes the specific question, criteria explanation, and output format instructions, followed by a partial view of the health news article due to space constraints

Performance evaluation

The evaluation of ChatGPT’s performance in assessing the quality of health news consists of two main parts. The first part focused on the accuracy of ChatGPT’s binary ratings of the news as either “satisfactory” or “not satisfactory.” The second part assessed the quality of explanations provided in ChatGPT’s responses to facilitate users’ understanding of the rating.

To gauge ChatGPT’s accuracy in rating the quality of health news, we utilized human evaluations from HealthNewsReview.org as the “gold standard.” We employed three metrics commonly used for automatic classification tasks: Precision, Recall, and F1 Score.

  1. Precision: it is calculated as the total number of True Positives (TP) divided by the sum of TP and False Positives (FP). Precision evaluates how many of the articles labeled as “satisfactory” by GPT-3.5-Turbo are also labeled by human evaluators from HealthNewsReview.org. For example, if GPT-3.5-Turbo rates 10 articles as satisfactory for a given criterion, but only 8 of them align with human evaluations, the precision would be 8/10 = 0.8.

  2. Recall: it is calculated as the total number of TP divided by the sum of TP and False Negatives (FN). Recall evaluates how well GPT-3.5-Turbo captures all relevant satisfactory articles identified by human evaluators. If there are 12 human-labeled satisfactory articles on a criterion, and GPT-3.5-Turbo correctly identifies 8, recall would be 8/12 = 0.67.

  3. F1 score: it is the harmonic mean of precision and recall. F1 synthesizes these two scores to provide a more balanced view of the performance. A high F1 score indicates that GPT-3.5-Turbo is performing well in both identifying relevant articles and minimizing incorrect ratings.

The evaluation of GPT-3.5-Turbo’s explainability involves both quantitative linguistic analysis and qualitative content evaluation:

  1. Quantitative Linguistic Analysis: We analyzed the linguistic complexity of GPT-3.5-Turbo’s responses by calculating word and sentence counts. Additionally, we employed the Simple Measure of Gobbledygook (SMOG) [35]to estimate the educational level required for comprehension, thereby assessing the readability of the generated responses. The readability assessment was intended to ensure that the generated responses are accessible to information consumers of different literacy levels [36, 37].

  2. Qualitative Content Evaluation: We employed two key metrics, consistency and context. Each metric is guided by specific questions for the assessment:

  • Consistency: “Is the explanation consistent with the evaluation result?” This question aims to measure how aligned the explanation text is with the given rating, ensuring that the explanation logically supports the outcome.

  • Context: “Does the explanation provide proper references to support its claims?” This question focuses on the explanation’s ability to offer sufficient and relevant in-text evidence or references that substantiate its assertions, enhancing the reader’s comprehension and trust.

For qualitative content evaluation, we conducted a manual evaluation of 100 randomly selected health news articles, each accompanied by GPT-generated ratings and explanation based on 9 criteria. The evaluation was conducted by all authors, consisting of three faculty members and two students from public health and one faculty member from health informatics. All authors worked in pairs. Each pair independently assessed the quality of explanatory narratives using a three-point Likert scale (strong, fair, weak). To ensure consistency, all reviewers received training from the first author, who has prior experience applying the HealthNewsReview.org criteria in research on health misinformation [11]. In cases of disagreement within a pair, a third reviewer resolved the conflict. The comprehensive evaluation guidelines are detailed in supplementary material.

Prior to the full evaluation, a pilot study was conducted to refine the evaluation criteria and procedures, ensuring clarity and reliability in the assessment process. This included additional training for the reviewers and adjustments to the Likert scale based on team discussions. Data analysis incorporates Percentage Agreement (PA) to measure inter-rater reliability (IRR), providing a statistical estimate of the agreement between reviewers.

Results

Evaluation of health news rating prediction

Upon the exclusion of nonfunctional links, the dataset for analysis comprised 2130 health news articles. The distribution of satisfactory to not satisfactory news items across nine criteria was as follows: Cost (421:1709), Benefit (641:1489), Harm (648:1482), Quality (724:1406), Mongering (1708:422), Conflict (1035:1095), Alternative (938:1192), Availability (1338:792), and Novelty (1474:656).

In the context of health news quality assessment, GPT-3.5-Turbo exhibited varying levels of performance across different criteria as shown in Fig. 3. The highest performance was observed for the Cost criterion with an F1 score of 0.824, indicating its strong ability to accurately assess and classify cost-related information. Conversely, the Benefit, Conflict and Quality criteria revealed areas where the model struggled with F1 scores below 0.5, highlighting potential areas for further improvement and refinement. Criteria including Harm, Mongering, Alternative, Availability, and Novelty have F1 scores in the intermediate range of 0.5 to 0.7.

Fig. 3.

Fig. 3

Performance of ChatGPT in assessing health news quality, evaluated across criteria using Precision, Recall, and F1 Score with 95% Confidence Intervals (CI)

An error analysis was conducted, and a confusion matrix was generated to examine GPT-3.5-Turbo’s classification performance in health news evaluation, as shown in Table 1. The matrix presents the distribution of correct predictions (TP and TN) and misclassifications (FP and FN) across different criteria. Table 1 also includes examples of misclassified cases, illustrating instances where GPT-3.5-Turbo either incorrectly rated articles as satisfactory or failed to recognize satisfactory ones.

Table 1.

Confusion matrix and examples of ChatGPT’s misclassifications in health news evaluation. This table reports the number of TP, FP, FN and TN and provides examples of misclassifications (FP and FN) made by GPT-3.5-Turbo when evaluating health news

Criteria TP FP FN TN Example of Incorrect Prediction (FP and FN)
Criterion 1-Cost 143 278 47 1662 FP: GPT misinterpreted the equipment costs of 3-D mammography incurred by the facility as costs to the patient.
FN: GPT failed to recognize the stated cost of the intervention (“less than 5 pence ($0.07) per customer”) for cholesterol-lowering statin drugs.
Criterion 2- Benefit 459 182 985 504 FP: GPT overlooked key gaps, such as the duration of benefit, impact on pain and pressure, clinical significance of fibroid shrinkage, and potential for regrowth.
FN: GPT overlooked that the news cited the FDA’s finding of no evidence for green tea’s heart disease benefits as a valid evaluation.
Criterion 3 - Harm 340 308 493 989 FP: GPT mistakenly identified a discussion of treatment benefits and response rates as coverage of harms and side effects, despite no mention of adverse events.
FN: GPT overlooked its discussion of MRI-related stigmatization concerns as a potential harm.
Criterion 4 - Quality 552 172 899 507 FP: GPT mistook procedural details and expert opinions for a discussion of evidence quality, despite no mention of study rigor (a randomized clinical trial or not).
FN: GPT overlooked the discussion of knowledge gaps, study limitations, and ongoing research on ketamine for depression as evidence quality considerations.
Criterion 5 - Mongering 895 813 146 276 FP: GPT overlooked the overstatement of multiple sclerosis prevalence and sensational framing of a worst-case scenario for patients, which could contribute to unnecessary fear around the disease and its treatment.
FN: GPT misinterpreted the lack of prevalent discussion and the framing of early-stage low-grade prostate cancer as an overstatement of severity, while the human evaluator found no evidence of exaggeration.
Criterion 6 - Conflict 99 936 55 1040 FP: GPT overlooked that the news failed to disclose the extensive financial ties of the investigators, including patents and commercial interests, which are critical for assessing potential conflicts of interest.
FN: GPT failed to recognize that the news disclosed key conflicts of interest, including the employment of two researchers by Danone Research and the inclusion of a comment from an independent source.
Criterion 7 - Alternative 515 423 459 733 FP: GPT overlooked the article’s failure to mention key alternative strategies for reducing coronary heart disease risk, such as weight loss, exercise, and stress management, which are important complementary approaches.
FN: GPT failed to recognize that the study specifically compared PSA screening vs. no screening, making the omission of alternative screening methods contextually appropriate rather than a limitation.
Criterion 8 - Availability 630 708 274 518 FP: GPT overlooked the article’s failure to clarify whether marijuana extracts are available or still experimental and to distinguish oral extracts from smoked marijuana, leaving key availability details unclear.
FN: GPT failed to recognize that the retrospective nature of the study inherently indicates the availability of erectile dysfunction drugs, making an explicit discussion of availability unnecessary.
Criterion 9 - Novelty 878 596 372 284 FP: GPT overlooked that gene modification for SCID has existed for over 20 years, making the general approach not truly novel, even though the specific technique highlighted in the story is new.
FN: GPT failed to recognize that while the AMH test itself is not new, its use as a long-term fertility detection method for women without known fertility issues is a relatively novel application.

Evaluation of the explainability

Quantitative linguistic analysis

Linguistic analysis of the generated explanation texts revealed a range in word count from 72.34 to 155.09, with texts pertaining to the Quality criterion demonstrating the highest average word count (M = 155.09, SD = 71.19), and those associated with the Cost criterion exhibiting the lowest (M = 72.34, SD = 19.29). Sentence count across the texts displayed lower variability, with average counts spanning from 3.16 to 6.56 sentences. The Quality related texts had the greatest mean sentence count (M = 6.56, SD = 4.15), while texts for the Availability criterion had the least (M = 3.16, SD = 0.91). Median sentence counts approximated the means, suggesting a symmetric distribution of sentence counts within the corpus.

The assessment of text readability via the SMOG score, which approximates the educational level required for comprehension, indicated a mean score range from 10.82 to 13.12. This suggests that the complexity of the texts is compatible with understanding levels of individuals from late high school to early college education. Of all the criteria, the Quality explanatory texts necessitated the highest level of comprehension, with mean scores reaching 13.12. Conversely, texts relating to the Cost and Availability criteria were on the lower end of the readability spectrum. Table 2 presents the aggregated surface metrics of the generated explanations, including word count, sentence count, and SMOG score.

Table 2.

Table 2

Aggregated surface metrics for generated explanations by criterion. Higher mean and median values indicate longer explanations (more words and sentences) and increased reading complexity, while higher standard deviation reflects greater variability in these characteristics. Darker shades highlight higher values within the same surface metric across different criteria

Qualitative content evaluation

We assessed the explainability quality of GPT-3.5-Turbo using two metrics: consistency and context, through a manual evaluation process. In our analysis of 100 evaluated cases, the IRR for the response evaluation of each criterion consistently exceeded 70%. The IRR for the consistency metric was higher than that for the context metric.

Figures 4 and 5 present the analysis of GPT-3.5-Turbo’s explainability using two quality measures. There shows a high degree of consistency across most criteria. For the Cost criterion, GPT exhibits strong explanations with a frequency of 99 rated as “strong” and 1 as fair, without weak ratings. Similar are observed in Benefit, Alternative and Novelty criteria. The criterion Mongering reveals more rating variations with most percentage of weak ratings. When evaluated using Context, GPT demonstrates more variability with several criteria including Benefit, Quality and Alternative. When coded three ratings to scores strong = 3, fair = 2, weak = 1 and calculated the average the score, the GPT gets an average of 2.90 in Consistency and 2.73 in Context.

Fig. 4.

Fig. 4

Evaluating ChatGPT’s Explainability: Results on Consistency. This figure visualizes the distribution of Strong (blue), Fair (orange), and Weak (green) ratings across nine evaluation criteria

Fig. 5.

Fig. 5

Evaluating ChatGPT’s Explainability: Results on Context. This figure visualizes the distribution of Strong (blue), Fair (orange), and Weak (green) ratings across evaluation criteria

Discussion

In this paper, we conducted a comprehensive evaluation of a widely used LLM, GPT-3.5-Turbo, on the task of rating the quality of health news and explaining the rating results. Our findings suggest that while GPT-3.5-Turbo can simulate health news rating and explanation tasks, similar to those performed by experts at HealthNewsReview.org, its accuracy is not optimal when compared to supervised machine learning (SML) approaches, as reported by Afsana et al. [38]and Al-Jefri et al. [39]on the same dataset (Table 3). GPT’s performance was predominantly lower, except in the Cost criterion where it marginally exceeded Al-Jefri (2020)’s model. However, when it came to providing explanations based on its assessments, GPT excelled, maintaining consistency with its initial ratings and offering relevant narratives and evidence from the original text, even when the accuracy of the ratings was not optimal.

Table 3.

Table 3

Comparison of ChatGPT and supervised machine learning methods in health news quality evaluation. This table compares the precision (green), recall (blue), and F1 scores (orange) of ChatGPT, Afsana (2020), and Al-Jefri (2020) in the automatic health news rating task across nine evaluation criteria. Both Afsana (2020) and Al-Jefri (2020) utilized supervised machine learning methods. Darker shades indicate higher scores, representing better performance

Our study highlights a “low accuracy but high explainability” phenomenon with GPT. GPT’s low accuracy in health news evaluation likely stems from its difficulty in understanding context, rigid application of criteria, and struggles with implicit information. It often misinterprets key details, such as mistaking a discussion of treatment benefits for coverage of harms or failing to recognize that a study’s retrospective design implicitly indicates the availability of a drug. Additionally, it applies evaluation standards too rigidly, leading to errors like incorrectly flagging an article as unsatisfactory for not mentioning alternative screening methods, even when the study specifically compared screening with no screening. These issues suggest that GPT lacks the adaptability needed to assess nuanced health reporting, often focusing on surface-level patterns and semantics rather than deeper contextual meaning. This also indicates that while GPT and other LLMs excelled in medical exams and answering patients’ questions, judging the quality of health news remains a difficult task.

Although GPT demonstrated high explainability in consistency and context measures, and readability, its accuracy and veracity of the content remained questionable. Instances were observed where GPT fabricated nonexistent information or incorrectly stated that an article lacked information it actually contained. Additionally, even when a response was rated highly for context, there were cases where the explanation included detailed references to the article but failed to recognize the relevance of the materials. These observations align with existing critiques of generative AI, including hallucination where it is known to generate details or quotes from non-existent sources [40]. This suggests that LLMs require further enhancement and even close monitoring [41], since without accurate initial assessments, the explanations built upon them can create confusion and mislead readers about the medical interventions described in the news. Given GPT’s strengths in explanation compared to supervised machine learning, future studies that seek to use ChatGPT or other LLMs for automatic evaluation and explanation tasks might benefit from a hybrid approach. This approach could leverage the accuracy of supervised machine learning while integrating the explanatory and contextual strengths of LLMs. Additionally, retrieval augmented generation (RAG) [42] could be considered to enhance the specificity, diversity, and factual accuracy of explanations, as it allows models to respond to user queries by referencing a specified set of sentences.

Our pilot study also revealed that GPT struggled with handling long texts. While it has demonstrated strong performance in tasks involving structured prompts or shorter texts, its effectiveness declined when evaluating full-length health news articles. Initially, we tested a single prompt that included all nine criteria together, but this led to server failures, inconsistent formatting, and response drift, likely due to the extensive length of the generated outputs. GPT also sometimes misinterpreted the task by incorrectly summarizing input articles. To address these issues, we refined the prompt to evaluate each criterion individually with added explanations, which improved response consistency. The higher precision score for the cost criterion further suggests that structured prompting can be effective for well-defined evaluation tasks.

While prompt engineering techniques have shown effective in improving LLMs’ performance, their effectiveness may depend on the nature of the task. Few-shot (FS) prompting, for instance, provides the model with labeled examples to help it recognize patterns before evaluating new data [43]. However, this approach presents additional challenges in our study because there are many different ways in which a health news article can be considered satisfactory on a criterion. For example, for the cost criterion, a news article may be considered satisfactory if it mentions health insurance coverage or if it provides specific cost details. The phrasing, level of detail, and structure of these examples can vary significantly. The complexity only increases for other criteria. This makes it difficult to provide a representative set of few-shot examples that comprehensively capture the range of satisfactory cases without introducing bias. There are also other prompting techniques such as Chain-of-Verification (CoVe) [44], Reversing Chain-of-Thought (RCoT) [45], Self-Consistency [46] that aim to improve factual consistency and explanation quality. These methods have shown promise in reducing hallucinations and improving reliability, particularly in structured tasks like mathematical reasoning, logic problems, and fact-based Q&A. However, their effectiveness in open-ended, context-rich tasks like health news evaluation remains largely unexplored. Our study focused on assessing baseline performance with simple prompting, but future work may benefit from investigating whether and how such advanced prompting strategies can be adapted for more complex, real-world text analysis.

A notable limitation of this study is the exclusive use of the GPT-3.5 model. Although newer LLMs, such as GPT-4, have since been introduced, GPT-3.5 remains widely adopted and is still in line with current LLM capabilities. Its strong performance in natural language understanding and text generation made it a suitable choice for our research objectives at the point of our experiment began. While newer models and alternative LLM architectures may offer incremental improvements, their core functionalities are largely consistent with GPT-3.5, and their inclusion would likely not have significantly altered our findings. Nevertheless, we acknowledge the rapid advancements in LLM technology and recommend future studies consider comparing multiple models to assess any potential differences. Additionally, future studies could explore whether advancements in newer LLMs meaningfully improve performance in health news evaluation. Last but not the least, LLM use in health news evaluation raises ethical concerns, including AI-generated misinformation and bias. Future research should explore guidelines for responsible LLM deployment in health communication.

In summary, our study suggests that while GPT has promising performance in health news quality assessments, it is still far from being able to assist information consumers to judge and explain health news.

Conclusions

In this paper, we present an empirical study testing the feasibility of utilizing GPT, as an example of a widely used LLM, to evaluate and explain the quality of health news from a dataset annotated by experts, covering a wide range of health topics. We assessed GPT’s accuracy in providing ratings based on the established quality criteria from HealthNewsReview.org and its ability to generate explanations to help laypeople interpret those ratings. A mixed-methods approach was employed to evaluate the readability, consistency, and contextual relevance of the explanations provided by GPT. Our findings reveal that while ChatGPT shows limitations in accurately rating health news compared to expert- and SML-based evaluations, it excels in generating clear and comprehensible explanations. This suggests that GPT, and potentially other LLMs, can serve as valuable tools for enhancing the interpretability of AI systems designed to combat health misinformation, particularly by helping lay audiences understand the rationale behind quality ratings. Future research should focus on developing approaches to improve rating accuracy while leveraging the explainable capabilities of LLMs in AI-driven health information assessments Additionally, comparing and optimizing the performance of different LLMs will be crucial to identify the most effective models for both accurate health news evaluations and high-quality public communication. As LLMs continue to evolve, their integration into digital health information ecosystems should be explored to support public health literacy and combat misinformation on a broader scale.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1 (66.4KB, docx)

Acknowledgements

We acknowledge HealthNewsReview.org, founded and published by Gary Schwitzer, for its extensive work in reviewing and evaluating health news coverage from 2008 to 2022. Its rigorous assessments of completeness, accuracy, and balance in medical reporting provided the foundation for the expert-annotated dataset used in this study. Although the website was discontinued in mid-2022, its contributions have been invaluable in advancing health journalism and public health literacy.

Abbreviations

LLM

Large language model

AI

Artificial Intelligence

GPT

Generative Pre-trained Transformer

FP

False Positive

FN

False Negative

TP

True Positive

TN

True Negative

SMOG

Simple Measure of Gobbledygook

PA

Percentage Agreement

IRR

Inter-Rater Reliability

RAG

Retrieval Augmented Generation

Author contributions

X.L. initiated the research idea. X.L. and L.H. wrote the manuscript. E.A., E.L., A.G., and L.G. worked on the manual evaluation. A.G. conducted the pilot study. X.L., L.H., E.A., and E.L. reviewed the manuscript.

Funding

None.

Data availability

The data availability statement is provided within the manuscript.

Declarations

Ethics approval and consent to participate

No applicable.

Consent for publication

No applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.More than Half of American Households Used the Internet for Health-. Related Activities in 2019, NTIA Data Show| National Telecommunications and Information Administration [Internet]. [cited 2024 Aug 24]. Available from: https://www.ntia.gov/blog/2020/more-half-american-households-used-internet-health-related-activities-2019-ntia-data-show
  • 2.Zhang Y, Sun Y, Xie B. Quality of health information for consumers on the web: A systematic review of indicators, criteria, tools, and evaluation results. Journal of the Association for Information Science and Technology. 66(10):2071–84. 10.1002/asi.2331132313818 [Google Scholar]
  • 3.Swire-Thompson B, Lazer D. Public health and online misinformation: challenges and recommendations. Annu Rev Public Health. 2020;41:433–51. 10.1146/annurev-publhealth-040119-094127 [DOI] [PubMed] [Google Scholar]
  • 4.Krishna A, Thompson TL. Misinformation About Health: A Review of Health Communication and Misinformation Scholarship. American Behavioral Scientist. 2021;65(2):316–32. 10.1177/0002764219878223 [Google Scholar]
  • 5.Kim HK, Tandoc EC. Consequences of Online Misinformation on COVID-19: Two Potential Pathways and Disparity by eHealth Literacy. Front Psychol. 2022;13. https://www.frontiersin.org/journals/psychology/articles/ [DOI] [PMC free article] [PubMed]
  • 6.Pierri F, Perry BL, DeVerna MR, Yang KC, Flammini A, Menczer F, et al. Online misinformation is linked to early COVID-19 vaccination hesitancy and refusal. Sci Rep. 2022;12(1):5966. 10.1038/s41598-022-10070-w [DOI] [PMC free article] [PubMed]
  • 7.Verma G, Bhardwaj A, Aledavood T et al. Examining the impact of sharing COVID-19 misinformation online on mental health. Sci Rep. 2022;12:8045 10.1038/s41598-022-11488-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Vinck P, Pham PN, Bindu KK, Bedford J, Nilles EJ. Institutional trust and misinformation in the response to the 2018-19 Ebola outbreak in North Kivu, DR congo: a population-based survey. Lancet Infect Dis. 2019;19(5):529–36. [DOI] [PubMed] [Google Scholar]
  • 9.Nan X, Wang Y, Thier K. Why do people believe health misinformation and who is at risk? A systematic review of individual differences in susceptibility to health misinformation. Soc Sci Med. 2022;314:115398. 10.1016/j.socscimed.2022.115398 [DOI] [PubMed]
  • 10.Santos FCC. Artificial Intelligence in Automated Detection of Disinformation: A Thematic Analysis. Journalism and Media. 2023;4(2):679–87. 10.3390/journalmedia4020043 [Google Scholar]
  • 11.Liu X, Alsghaier H, Tong L, Ataullah A, McRoy S. Visualizing the Interpretation of a Criteria-Driven System That Automatically Evaluates the Quality of Health News: Exploratory Study of 2 Approaches. JMIR AI. 2022;1(1):e37751. 10.2196/37751 [DOI] [PMC free article] [PubMed]
  • 12.Ibrishimova MD, Li KF. A machine learning approach to fake news detection using knowledge verification and natural Language processing. In: Barolli L, Nishino H, Miwa H, editors. Advances in intelligent networking and collaborative systems. Cham: Springer International Publishing; 2020. pp. 223–34. 10.1007/978-3-030-29035-1_22
  • 13.Meppelink CS, Hendriks H, Trilling D, van Weert JCM, Shao A, Smit ES. Reliable or not? An automated classification of webpages about early childhood vaccination using supervised machine learning. Patient Education and Counseling. 2021;104(6):1460–6. 10.1016/j.pec.2020.11.013 [DOI] [PubMed]
  • 14.Abdelminaam DS, Ismail FH, Taha M, Taha A, Houssein EH, Nabil A. CoAID-DEEP: An Optimized Intelligent Framework for Automated Detecting COVID-19 Misleading Information on Twitter. IEEE Access. 2021;9:27840–67. 10.1109/ACCESS.2021.3058066 [DOI] [PMC free article] [PubMed]
  • 15.Elhadad MK, Li KF, Gebali F. Detecting Misleading Information on COVID-19. IEEE Access. 2020;8:165201–15. 10.1109/ACCESS.2020.3022867 [DOI] [PMC free article] [PubMed]
  • 16.Al-Jefri M, Evans R, Lee J, Ghezzi P, Uchyigit G. Using machine learning for automatic identification of evidence-based health information on the web.Proceedings of the 2017 international conference on digital health. 2017;167–174. 10.1145/3079452.307947
  • 17.Shah Z, Surian D, Dyda A, Coiera E, Mandl KD, Dunn AG. Automatically Appraising the Credibility of Vaccine-Related Web Pages Shared on Social Media: A Twitter Surveillance Study. J Med Internet Res. 2019;21(11):e14007. 10.2196/14007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu Y, Yu K, Wu X, Qing L, Peng Y. Analysis and detection of Health-Related misinformation on Chinese social media. IEEE Access. 2019;7:154480–9. 10.1109/ACCESS.2019.2946624
  • 19.Asr FT, Taboada M. Big Data and quality data for fake news and misinformation detection. Big Data & Society. 2019; 6(1). 10.1177/2053951719843310 [Google Scholar]
  • 20.Botnevik B, Sakariassen E, Setty V. BRENDA: Browser Extension for Fake News Detection. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020;2117–20. 10.1145/3397271.3401396
  • 21.Pareek S, van Berkel N, Velloso E, Goncalves J. Effect of Explanation Conceptualisations on Trust in AI-assisted Credibility Assessment. Proceedings of the ACM on Human-Computer Interaction. 2024;8(CSCW2):1–31. 10.1145/3686922
  • 22.Zhang Y, Liao QV, Bellamy RKE. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency 2020;295–305. 10.1145/3351095.3372852
  • 23.Liu H, Lai V, Tan C. Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making. Proc ACM Hum-Comput Interact. 2021;5(CSCW2):408:1–408:45. 10.1145/347955236644216 [Google Scholar]
  • 24.Ayoub J, Yang XJ, Zhou F, Combat. COVID-19 infodemic using explainable natural language processing models. Information Processing & Management. 2021;58(4):102569. 10.1016/j.ipm.2021.102569 [DOI] [PMC free article] [PubMed]
  • 25.Guestrin MTRS, Singh C, O’Reilly M. 2016 [cited 2022 Feb 15]. Local Interpretable Model-Agnostic Explanations (LIME): An Introduction. Available from: https://www.oreilly.com/content/introduction-to-local-interpretable-model-agnostic-explanations-lime/
  • 26.Lundberg S. shap: A unified approach to explain the output of any machine learning model. [Internet]. [cited 2022 Feb 15]. Available from: http://github.com/slundberg/shap
  • 27.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed]
  • 28.He Z, Bhasuran B, Jin Q, Tian S, Hanna K, Shavor C et al. Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study. Journal of Medical Internet Research. 2024;26(1):e56655. 10.2196/56655 [DOI] [PMC free article] [PubMed]
  • 29.Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial Intelligence and Public Health: Evaluating ChatGPT Responses to Vaccination Myths and Misconceptions. Vaccines (Basel). 2023;11(7):1217. 10.3390/vaccines11071217 [DOI] [PMC free article] [PubMed]
  • 30.Luo Y, Miao Y, Zhao Y, Li J, Chen Y, Yue Y et al. Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study. JMIR Formative Research. 2024;8(1):e63188. 10.2196/63188 [DOI] [PMC free article] [PubMed]
  • 31.Choi EC, Ferrara E. Automated Claim Matching with Large Language Models: Empowering Fact-Checkers in the Fight Against Misinformation.Companion Proceedings of the ACM Web Conference 2024. 2024;1441–9. 10.1145/3589335.3651910
  • 32.Vitamin D. supplements may slow prostate cancer| [Internet]. [cited 2024 Aug 24]. Available from: https://web.archive.org/web/20220705220359/https://www.healthnewsreview.org/review/vitamin-d-supplements-may-slow-prostate-cancer/
  • 33.Schwitzer G, How Do US. Journalists Cover Treatments, Tests, Products, and Procedures? An Evaluation of 500 Stories. PLOS Medicine. 2008;5(5):e95. 10.1371/journal.pmed.0050095 [DOI] [PMC free article] [PubMed]
  • 34.Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems. 2022;35:24824–37. [Google Scholar]
  • 35.Grabeel KL, Russomanno J, Oelschlegel S, Tester E, Heidel RE. Computerized versus hand-scored health literacy tools: a comparison of Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid in printed patient education materials. J Med Libr Assoc. 2018;106(1):38–45. 10.5195/jmla.2018.262 [DOI] [PMC free article] [PubMed]
  • 36.Elahi E,Morato J. Graphical Contents and Health Websites Readability. IEEE Access.2023;11:140934–42. 10.1109/ACCESS.2023.3341493 [Google Scholar]
  • 37.Elahi E, Iglesias A, Morato J. Web Images Relevance and Quality: User Evaluation.Proceedings of the 5th International Conference on Computer Science and Software Engineering..2022;66–9. 10.1145/3569966.3569984
  • 38.Afsana F, Kabir MA, Hassan N, Paul M. Automatically assessing quality of online health articles. IEEE J Biomed Health Inf. 2021;25(2):591–601. 10.1109/JBHI.2020.3032479 [DOI] [PubMed]
  • 39.Al-Jefri M, Evans R, Lee J, Ghezzi P. Automatic Identification of Information Quality Metrics in Health News Stories. Front Public Health. 2020;8:515347. 10.3389/fpubh.2020.515347 [DOI] [PMC free article] [PubMed]
  • 40.Emsley R. ChatGPT: these are not hallucinations– they’re fabrications and falsifications. Schizophrenia (Heidelb). 2023;9(1):1–2. 10.1038/s41537-023-00379-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pandit S, Xu J, Hong J, Wang Z, Chen T, Xu K et al. MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models. arXiv.2025. 10.48550/arXiv.2502.14302
  • 42.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems.2020;9459–74. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
  • 43.Wang Y, Yao Q, Kwok JT, Ni LM. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput Surv. 2020;53(3):63:1–63:34. 10.1145/3386252 [Google Scholar]
  • 44.Dhuliawala S, Komeili M, Xu J, Raileanu R, Li X, Celikyilmaz A et al. Chain-of-Verification Reduces Hallucination in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024;3563–3578. 10.18653/v1/2024.findings-acl.212
  • 45.Xue T, Wang Z, Wang Z, Han C, Yu P, Ji H. RCOT:Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought. arXiv.2023; 10.48550/arXiv.2305.1149937986725 [Google Scholar]
  • 46.Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2023. 10.48550/arXiv.2203.11171

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (66.4KB, docx)

Data Availability Statement

The data availability statement is provided within the manuscript.


Articles from BMC Public Health are provided here courtesy of BMC

RESOURCES