Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 25;16:7594. doi: 10.1038/s41598-026-39046-w

Large language models show Dunning-Kruger-like effects in multilingual fact-checking

Ihsan Ayyub Qazi 1,, Zohaib Khan 1, Abdullah Ghani 1, Agha Ali Raza 1, Zafar Ayyub Qazi 1, Wassay Sajjad 1, Ayesha Ali 2, Asher Javaid 1, Muhammad Abdullah Sohail 1, Abdul Hameed Azeemi 1
PMCID: PMC12936108  PMID: 41741518

Abstract

The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.

Keywords: Misinformation, Fact-checking, Large language models, Dunning-Kruger effect

Subject terms: Computational science, Computer science

Introduction

As large language models (LLMs) become increasingly integrated into everyday life, their influence on the global information ecosystem is both undeniable and expanding18. Popular search engines, such as Google Search, have begun experimenting with AI-generated summaries to accelerate information retrieval for users9. Independent fact-checking organizations are likewise incorporating generative AI into their verification processes, to streamline the rapid evaluation of online claims10,11. However, these developments raise critical questions about the effectiveness of LLMs as fact-checking tools and their impacts on information integrity, public trust, and democratic processes.

First, while LLMs excel at processing and summarizing information, they are prone to hallucinations, generating plausible, human-like yet entirely false content, which heightens the risk that misinformation could be amplified1214. Compounding this issue is the widespread perception-reality gap, where many users hold inflated expectations about AI accuracy or remain unaware of known weaknesses15,16. A striking example involved a professor from a reputable institution who unknowingly included an LLM-generated reference list in expert testimony, later discovered to contain non-existent citations17. These incidents underscore how AI-generated misinformation can swiftly enter authoritative discourse, particularly when users trust AI outputs without verification.

Second, equitable access to advanced LLMs remains a significant challenge. Because more capable models are costly and less accessible, many users and smaller organizations rely on smaller, less reliable alternatives, heightening inaccuracy risks18,19. Compounding this challenge, LLMs frequently exhibit performance disparities in multilingual contexts, particularly for underrepresented languages20,21. These concerns resonate with recent platform-level interventions, such as Meta’s “Community Notes”, which rely on user-generated annotations to correct or contextualize misleading posts22. However, if contributors to these community-driven systems rely on LLM-generated content, automation bias, the tendency to over-rely on automated systems, can create feedback loops that amplify rather than curtail the spread of misinformation. This risk is heightened by the fact that AI-generated text is often indistinguishable from human-written content, while people consistently overestimate their ability to detect it2325. Consequently, AI-generated falsehoods can spread widely before identification and correction.

Third, deploying AI-driven fact-checking systems presents further challenges, especially under regulatory frameworks such as the EU AI Act26. This initiative, among other international efforts, identifies certain AI applications as high-risk when they can significantly influence public trust and democratic processes. The classification as high-risk arises from a risk-based assessment defined in the Act, which considers the potential for societal harm, implications for fundamental rights, and the need for accountability. Such AI systems must therefore meet stringent standards for reliability and fairness to mitigate the risks associated with misinformation. However, the absence of benchmark datasets that feature human-validated LLM responses to fact-checking claims remains a significant barrier to achieving consistent calibration and standardization of these systems.

Current evaluations of LLMs for fact-checking use researcher-crafted prompts and structured response formats rather than studying natural user interactions27,28. These methods rely on predetermined response structures (such as mandated JSON responses with explanations)29,30, and utilize datasets with limited geographic scope31, inadequately representing global misinformation complexities or realistic user interactions. Even when assessing multilingual capabilities, evaluations often overlook the critical aspect of native language prompting, often using English instructions even for non-English claims32,33.

Our work addresses these gaps by introducing a methodology that mirrors real-world user experiences with global misinformation. Leveraging a dataset of real-world claims sourced globally, across diverse linguistic and regional contexts, we evaluate LLMs using simple, unstructured prompts (e.g., “Is this true?,” “Is this false?,” “Is this true or false?”) that mirror lay person queries30,3436. Moreover, we translate these prompts into the claim’s language for multilingual cases, capturing the nuances of native-language interaction. We also evaluate using a system prompt (instructions that define the model’s role and behavior) that adopts the persona of a professional fact-checker, inspired by International Fact-Checking Network (IFCN) guidelines37, and compare these results with typical user promptings. This approach assesses LLMs’ practical utility as accessible fact-checking tools across diverse linguistic and regional contexts, offering insights into their real-world impact on misinformation, a perspective largely unexplored in previous research.

To this end, we evaluate nine established models spanning multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) to assess their effectiveness as fact-checking tools. We employ four distinct prompting strategies and test the models on a high-quality, diverse dataset of 5,000 factual claims spanning multiple languages, domains, and countries across both the Global North and the Global South. To ensure accuracy and reliability, each model’s responses undergo review by multiple annotators. Beyond standard classification metrics, we employ additional measures, namely, Selective Accuracy, Abstention-Friendly Accuracy and Certainty Rate, to offer a more holistic perspective on how these models balance confidence and correctness.

Our findings reveal critical trade-offs in model behavior and provide actionable insights into improving the robustness and fairness of LLMs for scalable misinformation mitigation. In particular, we uncover trade-offs in fact-checking between small and large LLMs reminiscent of the Dunning-Kruger effect38, where smaller models display high confidence despite lower accuracy, while larger models show low confidence but high accuracy. This pattern has far-reaching implications for the information ecosystem. Specifically, when combined with the economic reality that resource-constrained organizations often rely on smaller models, it risks creating a systemic bias where the most accessible LLM based fact-checking tools are also the most miscalibrated. As automated fact-checking systems proliferate, this dynamic could amplify existing information inequalities between well-resourced and resource-constrained contexts, underscoring the need for both technical solutions in confidence calibration and policy frameworks that ensure equitable access to reliable verification tools. Such insights, in turn, help shape best practices for the ethical and responsible integration of generative AI into the global fact-checking landscape.

Results

Study design, dataset, and models

The dataset of true and false claims used in this study was compiled using Google Fact Check (GFC) Explorer39; a tool that aggregates fact checks from various independent fact-checking organizations worldwide. Using the GFC Claim Search API, we fetched claims verified by 250 fact-checking organizations globally, yielding 302,288 claims. Data cleaning removed 60,741 duplicate claims, identified within or across publishers. While all fact-checking organizations use true/false categories, they vary significantly in their intermediate labels like “lacks context,” “partially true,” or “misleading.” Forcing these nuanced labels into binary categories risks misrepresenting ground truth. We addressed this by selecting only claims with consistent true/false ratings across all fact-checking organizations, excluding cases with intermediate labels.

Using a structured codebook based on professional fact-checker criteria, trained annotators ensured only claims with unambiguous true/false ratings were included. This resulted in 137,393 claims containing 125,215 false claims and 12,178 true claims. From this, we constructed an evaluation dataset by randomly sampling 5,000 claims, weighted by language distribution. We selected equal numbers of true and false claims per language when possible. Where true claims were insufficient, we included all available true claims and added randomly selected false claims to maintain the intended language distribution. The resulting evaluation set comprised 2,967 false and 2,033 true claims across 47 languages, fact-checked by 174 unique publishers, yielding a dataset that mirrors the complexity and diversity of the real-world information ecosystem. These claims spanned from March 2009 to June 2023. We evaluated nine LLMs on this dataset: Llama-2 (7B, 13B, 70B), Mistral (Mistral-7B, Mixtral-8x7B)40,41, and OpenAI’s GPT series (GPT-3.5, GPT-4, GPT-4o, o1-preview)42. To address potential data contamination from claims possibly seen during training, we also test model generalizability on claims published after their respective knowledge cutoffs (see Supplementary Section A.4).

We evaluated LLMs using four prompting strategies. Research shows typical users employ simple, direct prompts without prompt engineering techniques and rarely use structured output formats, despite their impact on model performance30,3436. For example, “Hey @Grok, is this true?” has become a common query on the X platform34. Thus, to mirror real user behavior and assess robustness against minor variations in phrasing, we used three basic prompts with unstructured output:

  • Prompt 1: {claim}. Is this True?

  • Prompt 2: {claim}. Is this False?

  • Prompt 3: {claim}. Is this True or False?

  • Prompt 4: Specialized system-level prompt adopting a professional fact-checker persona with IFCN-aligned guidelines37 and structured JSON output (see Supplementary Fig. S1)

Due to cost constraints, we evaluated o1-preview only with the specialized system-level prompt.

To mirror typical user interactions, the first three user-level prompts were presented in the claim’s original language. A French claim, for example, was prompted with “{claim in French}. Est-ce vrai?” instead of “{claim in French}. Is this True?”. All non-English model responses were translated into English using the Google Translate API. Given the unstructured nature of LLM responses to these user-level prompts, two independent annotators categorized them as ‘True,’ ‘False,’ or ‘Other’ (Fig. 1). A response was classified as ‘True’ or ‘False’ only when the LLM explicitly and unequivocally labeled the claim. Responses showing ambiguity or uncertainty, including terms like ‘probably’, ‘possibly’, or ‘partly’, or failing to provide a definitive true/false assessment, were categorized as ‘Other.’ We assessed inter-annotator agreement using Cohen’s Kappa. Across 24 annotation sets (representing three prompts for each of the eight LLMs, each containing 5,000 claims), the mean Cohen’s Kappa score was 0.84, with a minimum of 0.65, indicating substantial agreement. Annotation discrepancies were resolved through discussion, ensuring a consistent and reliable final dataset for analysis.

Fig. 1.

Fig. 1

Overview of Prompting Strategies and Response Evaluation for LLM Fact-Checking. (Orange) The Prompt Constructor dynamically constructs a tailored prompt to elicit factual assessments from the LLM. The system first receives a factual claim, such as “Les vaccins provoquent le cancer.” Based on the selected prompting strategy (1, 2, or 3), the system appends a language-specific user instruction suffix, for example, “Est-ce vrai?” for French, ensuring the claim is framed appropriately for the model. Alternatively, if Prompting Strategy 4 is selected, the system prompt itself is modified to include the instruction (e.g., “You are a...”), while the user message contains only the raw claim without a language-specific suffix. In both cases, the system utilizes structured prompt tags like <|im_start|> and <|im_end|> to clearly define system, user, and assistant segments. The constructed prompt is then submitted to the LLM for inference. (Blue) The Evaluation module processes the LLM’s response. Responses are annotated as ‘True’, ‘False,’ or ‘Other,’ depending on the model’s output. The model may return either verbose explanations (e.g., “Cette affirmation est incorrecte...”) or structured outputs in JSON format (e.g., {“prediction”: “false”, “explanation”: “C’est...”}). Responses are then compared against gold-standard labels (True or False) for correctness. Aggregation metrics are computed to assess model accuracy across different prompting strategies. This pipeline enables comprehensive evaluation by combining prompt variation, multilingual support, structured output processing, and flexible annotation strategies.

Conversational versus specialized prompting effects on multilingual LLM fact verification

To evaluate the fact-checking performance of each model across the various prompts, we report three primary metrics: selective accuracy, abstention-friendly accuracy, and certainty rate. These metrics are rooted in the principles of selective classification, following the seminal work of Chow43,44, which allow models to abstain from making predictions when uncertain. Such an approach provides a more nuanced evaluation of model reliability and calibration, as established in prior work45,46. For additional robustness checks, we also report selective accuracy over claims common across all models (see Supplementary Fig. S3) as well as standard metrics of precision, recall, and F1-score, calculated over both selective (excluding ‘Other’) and complete sets of claims.

Selective accuracy is the proportion of those definitive ‘True’ or ‘False’ responses that correctly matched the ground-truth veracity label. Abstention-friendly accuracy measures the proportion of all claims where the LLM either provides a correct ‘True’ or ‘False’ verdict or abstains with an ‘Other’ response, This metric is particularly valuable in sensitive domains like law, medicine, and security, where incorrect predictions can have serious consequences4750. The certainty rate (coverage in selective classification literature) represents the proportion of all claims for which the LLM provided a definitive ‘True’ or ‘False’ response. Our use of certainty rate as a behavioral proxy to assess the confidence of black-box LLMs serves as a methodological approach that addresses the limitations of existing confidence estimation techniques for real-world fact-checking tasks involving free-text evaluations. First, directly querying a model for its confidence score is unreliable due to their demonstrably poor calibration properties that compromise the validity of self-reported uncertainty estimates51,52. Second, while metrics derived from token-level log-probabilities are useful for fixed-label tasks (e.g., a single ‘True’ or ‘False’ token), they perform poorly on free-form generation. This degradation occurs because such metrics become undefined when multiple semantically equivalent phrasings are acceptable; a common characteristic of the unstructured, free-text outputs produced by our user-level prompts15. Moreover, the P(True) self-evaluation method15,53, a sophisticated workaround for free-text outputs, is poorly calibrated, especially for lengthy responses, as models are often overconfident when assessing their own outputs53.

Prompting strategy is strongly correlated with model performance, affecting both accuracy and confidence in responses (Fig. 2). The system-level prompt consistently achieves the highest selective accuracy across all models by providing structured reasoning guidelines that reduce ambiguity. However, system prompts create a trade-off for smaller models: while improving selective accuracy, they reduce abstention-friendly responses. This suggests smaller models over-interpret structured directives, committing to verdicts even when abstention would be safer, unlike larger models that maintain more discerning abstention behavior. Regarding certainty rates, smaller models (Llama-2, Mistral) show substantial increases in certainty rate with system prompts, while larger models (GPT-family) show minimal differences. For larger, more sophisticated models, the system prompt enhances judgment quality (accuracy) rather than simply increasing prediction quantity (certainty), likely due to a combination of their inherent capabilities and the prompt’s structured guidance. On average, larger models also demonstrate higher Precision than smaller models. While Recall shows a similar improving trend with model size when ’Other’ responses are excluded, its value is lower when ’Other’ responses are included, as larger models that abstain are penalized in this calculation.

Fig. 2.

Fig. 2

Performance of LLMs across four prompting strategies. The figure provides a detailed breakdown of model performance. (Top-left) Heatmap of selective accuracy, showing the proportion of correct verdicts among all definitive ‘True’/‘False’ responses. (Top-right) Standard classification metrics (Precision, Recall, and F1 Score) and calculated exclusively on this same subset of definitive responses (‘Other’ excluded) and Selective Accuracy aggregated across prompts. (Middle-left) Heatmap of abstention-friendly accuracy, which measures the proportion of all claims where the model was either correct or appropriately abstained (’Other’). (Middle-right) Standard classification metrics calculated across all claims (’Other’ included) and Abstention-Friendly Accuracy aggregated across prompts. Note that this penalizes the Recall and F1 Score of more cautious models that abstain frequently. (Bottom-left) Heatmap of certainty rate, representing the proportion of all claims for which models provided a definitive ’True’/’False’ response. Refer to Supplementary Fig. S4, Supplementary Fig. S5, and Supplementary Fig. S6 for pair-wise statistical comparisons.

In summary, user-level prompts yield lower selective accuracy than a structured system prompt but maintain better abstention behavior in smaller models. Among user-level approaches, the “{claim}. Is this True or False?” prompt moderately improves both selective accuracy and confidence across most models, likely through framing effects that orient outputs toward explicit options.

Dunning-Kruger effect in LLM fact-checking: capability inversely correlates with confidence

Model size creates a trade-off between selective accuracy and certainty rate (representing an LLM’s self-perceived confidence), which is illustrated in Fig. 3. Larger models, such as GPT-4o, achieve high selective accuracy, reaching up to 89%, excelling in definitive judgments when they commit. However, these models also display a more pronounced tendency toward epistemic caution, evidenced by certainty rates below 40% when confronting complex or ambiguous claims. While this cautious response pattern may reflect better statistical calibration to knowledge limitations, these higher abstention rates can substantially impair operational efficiency in real-world fact-checking applications where timely decisions are crucial. Smaller models, like Llama-7B, demonstrate much higher certainty rates, reaching up to 88%, providing definitive responses across a wider range of claims. However, their confidence is often misplaced due to limited training data and reasoning capabilities, resulting in lower selective accuracy, which falls around 60%. While prompting strategies help mitigate some of these trade-offs, core differences between small and large models remain.

Fig. 3.

Fig. 3

The relationship between actual ability and perceived ability reveals a statistical pattern across model architectures similar to the Dunning-Kruger effect. (Top-left) Performance quartile analysis comparing actual ability (selective accuracy) to perceived ability (certainty rate) averaged across all prompts. Less capable models (Llama-7B, Llama-13B, Mistral-7B) systematically overestimate their abilities, while more advanced models (e.g., GPT-4, GPT-4o) tend to underestimate theirs. This inverse relationship persists across multiple evaluation frameworks: when using a common claim set of claims for selective accuracy (top-right), when measuring abstention-friendly accuracy that penalizes confident misinformation propagation (bottom-left), and when benchmarking against MMLU scores (bottom-right). The statistical pattern indicates that model scale correlates with calibrated self-assessment, with larger architectures producing confidence estimates that better align with their actual capabilities. Certainty rates and accuracy measures represent averaged values across prompts. Error bars show 95% confidence intervals (two-sided tests without adjustments for multiple comparisons). Similar trends hold for each prompt (see Supplementary Fig. S7).

This tradeoff resembles the Dunning-Kruger effect, where people with lower competence overestimate their abilities, while those with higher competence underestimate theirs38. LLMs display similar statistical patterns when selective accuracy and certainty rates are used as measures of actual and perceived ability, respectively, with smaller models showing high confidence despite lower accuracy, while larger models hedge responses even when correct (Fig. 3). Smaller models like Llama-7B frequently commit to definitive but incorrect answers, likely due to limited training data and reasoning abilities. In contrast, larger models like GPT-4 exhibit greater hesitation with lower certainty rates. The confidence–competence trade-off holds consistently even within model families. We observe, for example, that Llama-70B is more accurate than Llama-7B (0.72 vs. 0.65 selective accuracy) yet less decisive (0.59 vs. 0.74 Certainty Rate). This pattern is mirrored by Mixtral-8x7B, which is more accurate than Mistral-7B (0.69 vs. 0.56) but also less decisive (0.51 vs. 0.62 Certainty Rate), suggesting the finding is not limited to inter-family comparisons. Robustness checks confirm this pattern persists across multiple benchmarks: when evaluating only claims where all models provide definitive responses, when using MMLU scores as ability measures, and when assessing the proportion of correct or appropriately uncertain responses–a critical metric for preventing confident misinformation propagation. This cautious approach enables superior selective accuracy in larger models. However, this tendency may hinder their effectiveness in applications requiring rapid fact verification. This also reveals a core challenge in LLM design: smaller models, though decisive, risk propagating errors due to overconfidence, while larger models, though precise, exhibit excessive statistical caution. The Dunning-Kruger framework illuminates these patterns, reinforcing the need to optimize both certainty rate and selective accuracy for robust fact-checking systems.

OpenAI’s o1-preview model, leveraging enhanced reasoning capabilities, achieves both high selective accuracy (84%), abstention-friendly accuracy (88%) and certainty rate (72%) with statistically significant differences (Inline graphic) compared to almost all other models (See Fig. 4a; for statistical comparisons, see Supplementary Fig. S15, Supplementary Fig. S16, and Supplementary Fig. S17). By narrowing the gap between accuracy and confidence, o1-preview establishes a new benchmark for robust claim verification and demonstrates how improvements in LLM reasoning capabilities can help overcome limitations of non-CoT based LLMs. These improvements suggest that architectural enhancements can help bridge this divide, potentially enabling AI-driven fact-checking systems that are both reliable and appropriately confident. This balance is crucial for developing AI systems capable of effectively supporting misinformation detection and verification workflows. While o1-preview delivers significant performance improvements, this comes at a substantial computational cost of $88.75 per 1,000 claims (Figure 4b), limiting its viability for cost-sensitive applications. In contrast, smaller models like Llama-2-7B and Mistral-7B cost less than $0.10 per 1,000 claims but offer limited accuracy. GPT-4o, at $2.22 per 1,000 claims, strikes the best balance between cost and performance, appearing to be the most practical option for scalable, reliable fact-checking among the rest.

Fig. 4.

Fig. 4

Comparison of model performance and cost analysis. (a) Performance comparison across models using system-level prompt versus o1-preview. o1-preview achieves both high accuracy and high certainty rate while narrowing the gap between them, indicating that reasoning-enhanced models may be effective for automatic fact-checking. (Top) Selective accuracy comparisons using McNemar’s test (with adjustments for multiple-comparisons using the Holm-Bonferroni method) show statistically significant differences (Inline graphic) between o1-preview and all other models except GPT-4o and GPT-4. Analysis included correct labels (1) and incorrect labels (0), with sample sizes ranging from 1,156 to 4,394 claims (average Inline graphic). Only definitive responses (i.e., ‘True’ or ‘False’) common across pairs were included. (Bottom) Abstention-friendly and certainty rate comparisons (Inline graphic claims) using McNemar’s test (with adjustments for multiple-comparisons using the Holm-Bonferroni method) also show statistically significant differences (Inline graphic) between o1-preview and all other models. (b) Estimated API-based inference costs (USD) for verifying 1,000 claims. Costs were calculated based on the actual token usage in our experiments and API pricing as of November 29, 2024. Open-source models were run using commercial inference APIs (i.e. Together AI and DeepInfra), and their respective pricing was used for cost estimation. Together, the panels illustrate the significant performance-cost trade-off.

Geographic and linguistic disparities in LLM fact-checking performance

To evaluate linguistic diversity in fact-checking, we focused on the four languages with the largest sample sizes in our dataset to ensure statistical power (see Fig. 5), English (n = 1,769), Portuguese (n = 734), Spanish (n = 525), and Hindi (n = 399); see Supplementary Figs. S21-S23 for other languages with Inline graphic. Our linguistic analysis focuses on the average performance across prompts 1, 2, and 3, which provides a robust estimate of multilingual fact-checking capabilities by averaging out prompt-specific variations while maintaining consistent experimental conditions. These three prompts employ naturalistic variations of the same core task and are administered in the native language of each claim, representing the ecologically valid multilingual use case. This approach allows us to identify systematic cross-linguistic performance differences rather than artifacts of particular prompt formulations. System-level prompt is excluded from this average as it represents a fundamentally different experimental condition, using English throughout regardless of the original claim language, making it unsuitable for assessing native-language fact-checking performance.

Fig. 5.

Fig. 5

Selective accuracy, abstention-friendly accuracy, and certainty rate of models across the top four languages (English, Spanish, Hindi, and Portuguese) in our dataset. Each cell represents the average score across prompts 1, 2, and 3. The ‘Average’ row denotes the average performance of all models for that language. Refer to Supplementary Fig. S12, Supplementary Fig. S13, Supplementary Fig. S14, and Supplementary Tables S2, S3 for pair-wise statistical comparisons and their exact sample sizes.

We find that models exhibit a statistically significant decrease in selective accuracy for Portuguese and Hindi claims relative to English claims (pairwise Chi-Squared tests with Holm-Bonferroni correction, adjusted Inline graphic for both comparisons), with decreases up to 4.3%. Conversely, the abstention-friendly accuracy generally increases for most models, especially smaller ones (Llama-7B, Llama-13B, Llama-70B), when evaluating non-English claims. This trend aligns with a significant reduction in their certainty rate compared to English (pairwise Chi-Squared tests with Holm-Bonferroni correction, adjusted Inline graphic for all non-English languages), indicating a beneficial tendency for these models to appropriately default to ‘Other’ responses when facing linguistic or cultural unfamiliarity, thereby mitigating confident incorrect predictions. While larger models (e.g., GPT family) also exhibit relatively lower certainty rates across non-English languages compared to smaller models (e.g., 27% for GPT-4 vs. 82% for Llama-7B on Portuguese claims; pairwise McNemar’s test with Holm-Bonferroni correction, adjusted Inline graphic; see Supplementary Fig. S14), this lower certainty does not correspond to lower accuracy. Despite their more cautious tendency, larger models demonstrate significantly higher selective accuracy than smaller models (81% for GPT-4 vs. 63% for Llama-7B; pairwise McNemar’s test with Holm-Bonferroni correction, adjusted Inline graphic; see Supplementary Fig. S12). Furthermore, the confidence-competence paradox persists across the top four languages (Supplementary Fig. S20), though reduced certainty rates for Spanish and Hindi suggest that linguistic unfamiliarity induces more cautious behavior even in smaller models.

To evaluate regional disparities in model performance, we compared the fact-checking accuracy of multiple models on claims categorized as relevant to the Global North or Global South. Two independent annotators classified 5,000 claims into three categories: Global North, Global South, or Indistinguishable. Inter-annotator reliability was high (Cohen’s Kappa = 0.802), with discrepancies resolved through consensus to ensure classification rigor. After excluding 663 indistinguishable claims, the final dataset comprised 2,205 Global North and 2,131 Global South claims. For equitable evaluation, 1,000 claims were randomly sampled from each region, balanced for true/false labels. Most models performed better when evaluating Global North claims compared to Global South claims (Fig. 6). Selective accuracy declined notably for Mixtral-8x7B, Llama-70B, o1-preview, Llama-7B, and Llama-13B (6.2%−12.1%). OpenAI’s GPT-4o undergoes modest degradation (3.9%) as well, although it is not statistically significant at the 5% level. While we observe a fair increment in abstention-friendly accuracy for Global South claims for larger models (GPT-4o and GPT-3.5), it decreases for smaller models (Llama-7B and Llama-13B), suggesting that larger models better adapt to abstaining when uncertain. Notably, model size correlated with certainty rate disparities: larger models (GPT-4o, GPT-4, GPT-3.5, Mixtral-8x7B, Llama-70B) generally exhibited substantial degradations (22.0%–28.4%) in certainty rate when compared to smaller models (Llama-7B, Llama-13B) that showed smaller declines (11.7%–12.0%). While such patterns may indicate a tendency towards caution, particularly in larger models, the resultant lower certainty rates lead to fewer definitive judgments, thereby compromising the operational efficiency essential for resource-constrained fact-checking in the Global South. These trends, which are consistent with the pairwise statistical comparisons observed in Supplementary Table S4, may reflect training data skews toward Global North contexts or regional biases in model development. However, o1-preview demonstrated reduced disparities (12.4% certainty rate, 6.2% selective accuracy degradation), rivaling smaller models. This indicates that scaling model size and complexity need not compromise fairness.

Fig. 6.

Fig. 6

The performance of the majority of the models decreases while evaluating claims relevant to Global North vs. those from Global South. Green bars illustrate the percentage decrease in certainty rate, brown bars show the percentage decrease in selective accuracy, and blue bars show the percentage decrease in abstention-friendly accuracy. These decreases are observed when models evaluate claims from the global north compared to the global south. The evaluation used equal subsets of 1,000 claims per region, balancing true and false claims within each subset. Error bars represent 95% confidence intervals (two-sided tests).

Discussion

Our findings reveal that common user-level prompts reduce LLM confidence and selective accuracy in claim evaluation compared to structured system-level prompts. This suggests prior studies may have overestimated LLM performance by not mirroring actual user interactions. One potential solution is fine-tuning LLMs on paired user-system prompts for fact-checking, enabling models to apply system-level strategies even with simple user inputs. Beyond prompting strategies, significant information quality risks emerge with smaller, more cost-effective models favored in consumer applications. These models exhibit a “confidence-competence paradox”, maintaining high confidence despite lower accuracy. Their widespread use could undermine trust in automated systems and worsen misinformation issues. To mitigate these risks, LLMs should be calibrated to withhold verdicts until sufficient evidence is available.

This calibration challenge connects to a broader issue: our results reveal a “truth gap,” where effective fact-checking depends on substantial compute resources. This disparity risks deepening existing information inequalities as resource-rich institutions gain access to more sophisticated and reliable tools while resource-constrained contexts rely on less accurate and potentially overconfident alternatives. Addressing this requires model optimization, subsidized access to advanced systems, or collaborative resource-sharing frameworks.

Implementing these solutions presents system designers with significant challenges, particularly in calibrating model confidence for smaller models. Without proper calibration, overconfidence can propagate inaccuracies in high-stakes domains. Implementing confidence thresholds or human-in-the-loop mechanisms may help mitigate these risks, ensuring models provide definitive answers only when sufficiently certain. These measures are key to deploying reliable systems at scale.

The confidence-accuracy relationship becomes even more complex across languages. LLMs show lower certainty and selective accuracy when assessing non-English claims, with larger models demonstrating a more cautious yet ultimately more accurate approach compared to smaller, overconfident models. This language disparity has significant implications for non-English-speaking regions where the increased caution of models on non-English claims may compromise their effectiveness, all the while inadvertently facilitating misinformation when confident.

These linguistic disparities extend to geographical differences as well. Claims pertaining to the Global South showed diminished performance compared to those of the Global North, indicating potential training data skew. Surprisingly, the disparity in certainty rates between regions was most pronounced for the largest models evaluated. While such a calibration may be desirable to avoid the propagation of misinformation, it reflects a need to prioritize calibration with balanced linguistic subsets and equitable training data for future developments. In this regard, o1-preview emerged as an exception, showing relatively low disparities despite its size, suggesting architectural advancements may address this limitation.

These performance differences raise important regulatory considerations. For example, under the EU AI Act, fact-checking tools are typically classified as high-risk applications due to their impact on public trust and democratic processes, requiring them to meet rigorous reliability and fairness standards. The Act mandates human oversight, crucial for mitigating risks from both overconfident small models and overly cautious large ones, while ensuring accountability. Transparency requires disclosure of model limitations and biases, enabling users to critically evaluate AI-generated fact-checks. The “truth gap” we identified underscores the need for equitable access to high-performing systems through targeted subsidies or cost-efficient alternatives, particularly in resource-constrained regions. By adhering to these regulatory principles, fact-checking systems can achieve the reliability, transparency, and accessibility necessary to foster a trustworthy information ecosystem.

Given these regulatory and performance considerations, the observed trade-offs in model behavior raise questions about replacing professional fact-checkers with proprietary models. While larger models demonstrate high accuracy, their cautious nature may hinder quick judgments, and smaller models lack necessary precision. Rather than replacement, AI systems should complement human workflows, leveraging professional fact-checkers’ domain expertise and contextual understanding.

While our analysis provides valuable insights into these challenges, several limitations must be acknowledged. Our study relies on claims previously vetted by professional fact-checkers, potentially missing subtle misinformation that escaped formal scrutiny. Our assessment of post-training claims, though valuable for temporal analysis, may not capture how rapidly evolving contexts transform misinformation dynamics. In addition, although we use a controlled prompting setup for our experiments, we acknowledge that prompting behavior may vary across users and languages, and this remains an important domain for future work, to examine how native speakers across languages naturally prompt LLMs for fact-checking and whether such minor variations meaningfully affect model performance. Moreover, while our current study intentionally focuses on evaluating the inherent knowledge and reasoning capabilities of LLMs in isolation, aiming to establish a baseline understanding of their multilingual fact-checking performance without external augmentation, we acknowledge that incorporating external sources (e.g., structured databases or the Web) is a valuable extension and remains a key direction for future work. Furthermore, the research does not explore how different cultural epistemologies and regional information ecosystems might influence what constitutes effective fact-checking beyond Western frameworks, potentially limiting the applicability of its conclusions in diverse global settings. Additionally, our research offers limited insight into LLM performance on multimodal misinformation and under adversarial conditions where false information is deliberately crafted to evade detection. Lastly, although we evaluate four distinct prompting strategies and observe consistent patterns across all variants (Supplementary Fig. S7), future work could systematically investigate prompt sensitivity, including how abstention framing in the system-level prompt and tag placement (e.g., “Is this true?” before or after the claim) for user prompts affects model behavior.

Online Methods

Dataset collection

To compile a comprehensive dataset of fact-checked claims, we utilized the Google Fact Check (GFC) API, which enables querying via keyword searches or retrieval of all claims verified by a specific fact-checking publisher. Our initial approach involved keyword searches across broad categories (Economy, Culture, Business, Education, Politics, Science, Health, and Religion), without language restrictions. We then compiled a list of the fact-checking publishers associated with the resulting claim entries. This list was augmented by incorporating all fact-checking publishers represented in the research dataset of claims obtained from the GFC Markup Tool available on Data Commons. The resulting list, comprising 250 fact-checking publisher websites from across the globe, was used to query the GFC API, yielding a collection of all claims verified by them.

Evaluation framework

When evaluating fact-checking models, one of the key challenges is how they handle ambiguous claims or uncertain situations. Our evaluation framework adopts the principles of selective classification, where a model can choose to abstain from making a definitive prediction (by responding with ‘Other’) when it is uncertain. This makes it essential to assess not just whether models are accurate when they commit to a decision, but also how they balance confidence, correctness, and abstention. To provide a holistic view, we use three primary metrics. We use a two-letter notation to represent outcomes: the first letter indicates the ground truth (T for True, F for False), and the second indicates the model’s judgment (T, F, or O for Other). For example, TF represents a “false negative”, a true claim that the model incorrectly labeled as false.

Selective Accuracy: This metric assesses the reliability of the model’s definitive judgments. It calculates accuracy only on the subset of claims where the model did not abstain. Selective accuracy is particularly significant in the context of misinformation, where confidently incorrect answers (i.e., low selective accuracy) are more damaging than uncertainty.

graphic file with name d33e809.gif

Abstention-Friendly Accuracy (AFA): This metric provides a broader measure of model reliability by rewarding both correct definitive verdicts and appropriate abstentions. It is defined as the proportion of all claims where the model either provides a correct ‘True’ or ‘False’ judgment or provides an ‘Other’ response. This metric is especially useful in sensitive domains where a confidently wrong answer (TF or FT) is significantly worse than abstaining, as it penalizes models for making high-confidence errors.

graphic file with name d33e815.gif

Certainty Rate: This metric, also known as coverage in selective classification literature, measures how frequently a model provides a definitive ‘True’ or ‘False’ judgment rather than abstaining. A higher certainty rate indicates a more decisive model, while a lower rate reflects a more cautious one.

graphic file with name d33e825.gif

Together, these three metrics offer a comprehensive perspective on model performance, capturing the critical trade-offs between decisiveness (certainty rate), precision on committed answers (selective accuracy), and overall safety (abstention-friendly accuracy).

Regional classification

To evaluate Large Language Model (LLM) performance across Global North and South regions, we developed a three-phase framework for categorizing fact-checked claims. Phase 1 (Direct Geographical Assignment): Claims containing explicit geographical or sociopolitical markers (e.g., country names, localized events, or region-specific institutions) were directly assigned to the Global North or South based on the identified location. Phase 2 (Contextual Attribution): Claims lacking explicit markers but implying locale-specific relevance (e.g., “We are one of the four countries that consume more sugar in the world”) were attributed to the operational country of the originating fact-checking organization (e.g., Chequeado in Argentina, classified as Global South). Phase 3 (Exclusion of Indeterminate Claims): Claims lacking both explicit and implicit regional associations, such as universal scientific facts (e.g., “Water boils at 100Inline graphicC”), were classified as “Indistinguishable” and excluded from regional analysis. Regional classification (Global North/South) utilized the Organization for Women in Science for the Developing World (OWSD) country list54. This hierarchical approach, prioritizing explicit markers and contextual provenance, minimizes misclassification and enables robust cross-regional comparisons of LLM performance. The tagging was performed by two university students, ages 20 and 21, in their third and fourth years, fluent in English.

Statistical methods

To facilitate statistical analysis, we defined three binary performance metrics. First, selective accuracy was operationalized as a binary variable where a model’s definitive response (i.e., ‘True’ or ‘False’) was coded as 1 if correct and 0 if incorrect; claims where a model abstained (i.e., responded ‘Other’) were excluded from this metric for that model. Second, abstention-friendly accuracy was coded as 1 if a model’s response was either correct or an abstention, and 0 otherwise. Third, the certainty rate was coded as 1 if a model provided a definitive response and 0 if it abstained.

We conducted three distinct sets of pairwise comparisons for model performance. For all comparisons, we performed a family-wise error rate correction using the Holm-Bonferroni method to account for multiple comparisons. First, to compare model performance across prompts or between different models within a given language, we employed the two-sided McNemar’s test, which is appropriate for paired nominal data, as each comparison was conducted on the same set of claims. Second, for cross-language comparisons (e.g., GPT-4 English vs. GPT-4 Spanish) and cross-regional comparisons (e.g., GPT-4 Global-North vs. GPT-4 Global-South), we used Pearson’s Chi-Squared test of independence. The Chi-Squared test was only performed if all expected cell frequencies in the 2x2 contingency table were 5 or greater to ensure test validity.

Supplementary Information

Author contributions

Author Contributions: Conceived and designed the experiments: IAQ, ZAQ, AA, AAR Performed the experiments: IAQ, ZK, AG, AJ, WS, AHA, MAS Analyzed the data: IAQ, ZAQ, AA, AAR, ZK, AG, AJ, WS, AHA, MAS Contributed materials/analysis tools: IAQ, ZAQ, AAR, AA, ZK, AG Wrote the main manuscript text: IAQ, ZAQ, AA, ZK, AG

Data availability

All data and analysis scripts used in this study can be accessed at the following link: https://osf.io/87gzp/overview?view.

Code availability

Analysis scripts and custom code used in this study can be accessed through this link.

Declarations

Competing interests

The authors declare no competing interests.

Ethics

All authors meet Nature Portfolio authorship criteria, with no external collaborators. The study did not involve human participants or animal subjects, so institutional review board approval was not required. Throughout, we followed institutional research-integrity guidelines, the Global Code of Conduct for Research in Resource-Poor Settings, and inclusive citation practices to ensure that no groups are disadvantaged.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-39046-w.

References

  • 1.Burton, J. W. et al. How large language models can reshape collective intelligence. Nature Human Behavior8(9), 1643–1655. 10.1038/s41562-024-01959-9 (2024). [DOI] [PubMed] [Google Scholar]
  • 2.Jones, B. & Jones, R. Action research at the bbc: Interrogating artificial intelligence with journalists to generate actionable insights for the newsroom. Journalism 0(0). 10.1177/14648849251317150 (2025). [DOI] [PMC free article] [PubMed]
  • 3.DeVerna, M. R., Yan, H. Y., Yang, K. C. & Menczer, F. Fact-checking information from large language models can decrease headline discernment. Proc Natl Acad Sci U S A121(50), 2322823121. 10.1073/pnas.2322823121 (2024) (Epub 2024 Dec 4). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Leippold, M., et al. Automated fact-checking of climate claims with large language models. Technical Report 24-93 (Swiss Finance Institute, 2024). Forthcoming, Nature Climate Action. https://ssrn.com/abstract=4731802. [DOI] [PMC free article] [PubMed]
  • 5.Gilardi, F., Alizadeh, M. & Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences120(30), 2305016120. 10.1073/pnas.2305016120. https://www.pnas.org/doi/pdf/10.1073/pnas.2305016120 (2023). [DOI] [PMC free article] [PubMed]
  • 6.Vetter, M. A., Jiang, J. & McDowell, Z. J. An endangered species: how llms threaten wikipedia’s sustainability. AI & Society10.1007/s00146-025-02199-9 (2025). [Google Scholar]
  • 7.Brown, T.B., et al. Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 (2020).
  • 8.Touvron, H., et al. Llama 2: Open Foundation and Fine-Tuned Chat Models . https://arxiv.org/abs/2307.09288. (2023).
  • 9.Reid, L. Generative AI in Search: Let Google do the searching for you. Accessed: 2024-11-15. https://blog.google/products/search/generative-ai-google-search-may-2024/ (2024).
  • 10.Institute, T.P. The State of Fact-Checkers 2023. Accessed: 2024-12-07. https://www.poynter.org/wp-content/uploads/2024/04/State-of-Fact-Checkers-2023.pdf (2024).
  • 11.Journalism, R.I. Digital News Report 2024. Accessed: 2024-11-12. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2024-06/RISJ_DNR_2024_Digital_v10%20lr.pdf(2024).
  • 12.Jones, N. Ai hallucinations can’t be stopped - but these techniques can limit their damage. Nature637(8047), 778–780. 10.1038/d41586-025-00068-5 (2025). [DOI] [PubMed] [Google Scholar]
  • 13.Hao, G., et al. Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks. Scientific Reports 14, 16375. 10.1038/s41598-024-66708-4 (2024). [DOI] [PMC free article] [PubMed]
  • 14.Ji, Z. et al. Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38. 10.1145/3571730 (2023). [Google Scholar]
  • 15.Steyvers, M. et al. What large language models know and what people think they know. Nature Machine Intelligence7(2), 221–231. 10.1038/s42256-024-00976-7 (2025). [Google Scholar]
  • 16.Brauner, P., Glawe, F., Liehner, G.L., Vervier, L. & Ziefle, M. Misalignments in AI Perception: Quantitative Findings and Visual Mapping of How Experts and the Public Differ in Expectations and Risks, Benefits, and Value Judgments. https://arxiv.org/abs/2412.01459 (2024).
  • 17.Daniel, L. The irony–ai expert’s testimony collapses over fake ai citations. Forbes. Accessed: 2025-02-09 (2025).
  • 18.Sathish, V., Lin, H., Kamath, A.K. & Nyayachavadi, A. LLeMpower: Understanding Disparities in the Control and Access of Large Language Models. https://arxiv.org/abs/2404.09356(2024).
  • 19.Whiting, K. What is a small language model and how can businesses leverage this ai tool? World Economic Forum. Accessed: 2025-02-15 (2025).
  • 20.Dudy, S., et al. Unequal opportunities: Examining the bias in geographical recommendations by large language models. In: Proceedings of the 30th International Conference on Intelligent User Interfaces. IUI ’25, 1499–1516. (Association for Computing Machinery, New York, NY, USA, 2025) 10.1145/3708359.3712111.
  • 21.Sheng, E., Chang, K.-W., Natarajan, P. & Peng, N. The woman worked as a babysitter: On biases in language generation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3407–3412 (Association for Computational Linguistics, Hong Kong, China, 2019). https://arxiv.org/abs/1909.01326.
  • 22.Meta Newsroom: Testing Begins for Community Notes on Facebook, Instagram and Threads. Accessed: 2025-04-09 https://about.fb.com/news/2025/03/testing-begins-community-notes-facebook-instagram-threads/ (2025).
  • 23.Porter, B. & Machery, E. Ai-generated poetry is indistinguishable from human-written poetry and is rated more favorably. Scientific Reports14, 26133. 10.1038/s41598-024-76900-1 (2024). [Google Scholar]
  • 24.Jakesch, M., Hancock, J.T. & Naaman, M. Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences 120(11) 10.1073/pnas.2208839120arXiv:2208.83912 (2023). [DOI] [PMC free article] [PubMed]
  • 25.Spitale, G., Biller-Andorno, N. & Germani, F. Ai model gpt-3 (dis)informs us better than humans. Science Advances9(26), 1850. 10.1126/sciadv.adh1850 . https://www.science.org/doi/pdf/10.1126/sciadv.adh1850(2023). [DOI] [PMC free article] [PubMed]
  • 26.European Commission: AI Act. Accessed: 2025-01-05. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai (2024).
  • 27.Hoes, E., Altay, S. & Bermeo, J. Leveraging ChatGPT for Efficient Fact-Checking. PsyArXiv. 10.31234/osf.io/qnjkf . osf.io/preprints/psyarxiv/qnjkf_v1 (2023).
  • 28.Quelle, D. & Bovet, A. The perils and promises of fact-checking with large language models. Frontiers in Artificial Intelligence 710.3389/frai.2024.1341697 (2024). [DOI] [PMC free article] [PubMed]
  • 29.He, J., et al. Does Prompt Formatting Have Any Impact on LLM Performance?. https://arxiv.org/abs/2411.10541 (2024).
  • 30.Tam, Z.R., et al. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. https://arxiv.org/abs/2408.02442 (2024).
  • 31.Ma, H., et al. EX-FEVER: A dataset for multi-hop explainable fact verification. In: (eds Ku, L.-W., Martins, A., Srikumar, V..) Findings of the Association for Computational Linguistics: ACL 2024, 9340–9353 (Association for Computational Linguistics, Bangkok, Thailand, 2024). 10.18653/v1/2024.findings-acl.556.
  • 32.Pelrine, K., et al. Towards reliable misinformation mitigation: Generalization, uncertainty, and GPT-4. In: (eds Bouamor, H., Pino, J., Bali, K.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6399–6429. (Association for Computational Linguistics, Singapore, 2023). 10.18653/v1/2023.emnlp-main.395.
  • 33.Li, Y. & Zhai, C. An Exploration of Large Language Models for Verification of News Headlines . In: 2023 IEEE International Conference on Data Mining Workshops (ICDMW), 197–206 (IEEE Computer Society, Los Alamitos, CA, USA, 2023) 10.1109/ICDMW60847.2023.00032.
  • 34.Dawn: Hey chatbot, is this true? ai ‘factchecks’ sow misinformation. DAWN.COM (2025)
  • 35.Trippas, J.R., Al Lawati, S.F.D., Mackenzie, J. & Gallagher, L. What do users really ask large language models? an initial log analysis of google bard interactions in the wild. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24, 2703–2707 (Association for Computing Machinery, New York, NY, USA, 2024)  10.1145/3626772.3657914. [DOI]
  • 36.Kelly, B. & Harsel, L. Investigating ChatGPT Search: Insights from 80 Million Clickstream Records https://www.semrush.com/blog/chatgpt-search-insights/ (2025).
  • 37.International Fact-Checking Network: International Fact-Checking Network. Accessed: 2024-11-03. https://www.poynter.org/ifcn/ (2024).
  • 38.Kruger, J. & Dunning, D. Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology77(6), 1121–1134. 10.1037/0022-3514.77.6.1121 (1999). [DOI] [PubMed] [Google Scholar]
  • 39.Google: Fact Check Tools - About. Accessed: 2024-10-02. https://toolbox.google.com/factcheck/about
  • 40.Jiang, A.Q., et al. Mistral 7B https://arxiv.org/abs/2310.06825 (2023).
  • 41.Jiang, A.Q., et al. Mixtral of Experts https://arxiv.org/abs/2401.04088 (2024).
  • 42.OpenAI: OpenAI Models Documentation. Accessed: 2025-02-09 https://platform.openai.com/docs/models (2025).
  • 43.Chow, C. K. On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theor.16(1), 41–46. 10.1109/TIT.1970.1054406 (1970). [Google Scholar]
  • 44.Chow, C. K. An optimum character recognition system using decision functions10.1109/TEC.1957.5222035 (1957). [Google Scholar]
  • 45.Traub, J., et al. Overcoming Common Flaws in the Evaluation of Selective Classification Systems https://arxiv.org/abs/2407.01032 (2024).
  • 46.Madhusudhan, N., Madhusudhan, S.T., Yadav, V., & Hashemi, M. Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models . https://arxiv.org/abs/2407.16221 (2024).
  • 47.Weidinger, L., et al. Taxonomy of risks posed by language models. FAccT ’22, 214–229 (Association for Computing Machinery, New York, NY, USA, 2022)  10.1145/3531146.3533088. [DOI]
  • 48.Lin, S., Hilton, J. & Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods https://arxiv.org/abs/2109.07958 (2022).
  • 49.Xu, H., et al. Large Language Models for Cyber Security: A Systematic Literature Review https://arxiv.org/abs/2405.04760 (2025).
  • 50.Li, Y., Wang, S., Ding, H. & Chen, H. Large language models in finance: A survey. In: Proceedings of the Fourth ACM International Conference on AI in Finance. ICAIF ’23, 374–382 (Association for Computing Machinery, New York, NY, USA, 2023).  10.1145/3604237.3626869. [DOI]
  • 51.Yang, H., Wang, Y., Xu, X., Zhang, H. & Bian, Y. Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer https://arxiv.org/abs/2405.16856 (2024).
  • 52.Pawitan, Y. & Holmes, C. Confidence in the Reasoning of Large Language Models. Harvard Data Science Review 7(1) https://hdsr.mitpress.mit.edu/pub/jaqt0vpb(2025).
  • 53.Kadavath, S., et al. Language Models (Mostly) Know What They Know https://arxiv.org/abs/2207.05221 (2022).
  • 54.Organization for Women in Science for the Developing World: Countries in the Global South. Accessed: 2024-11-23 https://owsd.net/sites/default/files/OWSD%20138%20Countries%20-%20Global%20South.pdf (2020).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data and analysis scripts used in this study can be accessed at the following link: https://osf.io/87gzp/overview?view.

Analysis scripts and custom code used in this study can be accessed through this link.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES