Multi-agent systems and credibility-based advanced scoring mechanism in fact-checking

Yihan Dong; Takayuki Ito

doi:10.1038/s41598-026-41862-z

. 2026 Mar 3;16:11814. doi: 10.1038/s41598-026-41862-z

Multi-agent systems and credibility-based advanced scoring mechanism in fact-checking

Yihan Dong ^1,^✉, Takayuki Ito ¹

PMCID: PMC13066471 PMID: 41776252

Abstract

Fact-checking is crucial as rumours and misinformation negatively impact social networking services (SNS) and online discussions, often leading to the spread of misinformation. Meanwhile, fact-checking with large language models (LLMs) is becoming increasingly popular with the increase in the performance of LLMs. However, the previous works have issues, including overconfidence in the judgment results of LLM and the insufficiency of binary fact-checking due to the text’s complexity. On the other hand, using multiple information sources to make judgments reveals another obstacle: the lack of proper scoring mechanisms. Thus, we propose a framework called multi-agent fact-checking (MAFC), which includes multiple agents with unique information sources to measure the text’s credibility. Specifically, a brand-new scoring mechanism is also used to calculate credibility according to each agent’s judgment results and confidence. We tested our proposed method through several comparative experiments. The results of the experiments prove that the proposed method performs better than other baselines in both the binary fact-checking task and the multi-label fact-checking task. Finally, the challenges and obstacles existing in fact-checking fields, such as the definition standards and dataset creation, are discussed.

Keywords: Misinformation detection, Multi-agent systems, Human-agent interaction, Credibility modelling

Subject terms: Computer science, Information technology

Introduction

The rapid improvement of the Internet and its applications has made online discussions far more popular and significant than traditional communication methods. At the same time, rumours and misinformation now have a greater impact due to the increasing popularity of online platforms¹. Multiple previous studies point out that rumours and misinformation spread faster than truth through online social media^2,3. As a result, detecting and classifying information, also known as fact-checking, remains a significant challenge today.

In tandem with this trend, there has been a surge in the field of automatic fact-checking. According to the summary of these works belonging to this field⁴, three steps are adopted in the process of automatic fact-checking: (1) Claims Detection, (2) Evidence Retrieval and (3) Claims Verification. Furthermore, the third step can be divided into two types: (1) Verdict prediction, where claims are classified with truthful labels, and (2) Justification production, where explanations of the judgments must be produced.

Several studies have attempted to address the fact-checking problem by utilising all three steps mentioned above, leveraging the advancements in large language models (LLM). For example, FacTool is a project for factuality verification in generative AI with extra information provided⁵; Chain-of-Verification (CoVE) is a framework to verify if there are any hallucinations in the responses generated by LLM, but the information achieved from other sources is not provided comparing to FacTool, which means that CoVE judges the credibility only depending on LLM⁶; SelfCheckGPT is another framework to detect if the statements are factual by comparing multiple responses generated by LLM⁷.

On the other hand, some recent LLM-based multi-agent systems (MAS) for fact-checking tasks have been raised recently: LoCal⁸ is a framework including agents for decomposing, reasoning and evaluating tasks and outputs veracity in a label; MADR⁹ leverages multiple LLMs as agents with diverse roles to make them debate with each other to retrieve faithful explanations for fact-checking tasks; DelphiAgent¹⁰ shares a similar design with LoCal, to build LLM-based MAS including diverse agents with unique roles to emulate the Delphi method to verify claims’ factuality. The differences and features of these related research and the proposed method raised in this paper are illustrated in Table 1.

Table 1.

Comparison of related research and the proposed method.

	Information source	Verdict prediction	System design	Results format	Mechanism
FacTool	Google	Binary	Single agent	Labels	LLM decision
CoVE	LLM	Binary	Single Agent	Labels	LLM decision
SelfCheckGPT	LLM	Binary	Single Agent	Scores	Scoring
LoCal	the Internet	Multi-label	Multi-agent	Labels	LLM decision
MADR	LLM	Binary	Multi-agent	Labels	LLM decision
DelphiAgent	LLM	Multi-label	Multi-agent	Labels	LLM decision
Ours	Google/Wikipedia/LLM	Binary/Multi-label	Multi-agent	Scores	Scoring

Open in a new tab

There are three main issues in evidence retrieval and verdict prediction process, which limit the application of the fact-checking system in the online discussion field: (1) Most fact-checking relies on a single source, which is assumed to be authoritative. (2) The judgment results made by large language models (LLMs) with provided information are always considered overconfident. (3) Only the binary label classification task is insufficient due to the complexity of online text, since the scenario where only parts of the text on SNS are wrong happens, and to suppress the spreading of rumours or misinformation, detecting the credibility degree of rumours and misinformation is critical.

To address the challenges mentioned above and the obstacles, a multi-agent fact-checking framework combined with an LLM is proposed to measure how trustworthy the text is. Particularly, to address issues (1) and (2), multiple fact-checking agents are designed and implemented in this framework. Each agent gains evidence from a unique information source to judge the factuality of claims extracted from the original text, with a unique confidence level for each claim. The reason for using agents is to avoid overconfidence in the information retrieved from the internet, and the verification of the information is tricky. To address issue (3), Firstly, the following concepts are clarified: (1) credibility is an attribute of text, which describes the level of how much a piece of text can be trusted as a value between 0 and 1; (2) factuality is an attribute of claims, which describes if a claim is correct or not as a binary value: true or false. Then, a scoring mechanism to convert the factuality judgment results and the confidence of every agent to present the credibility of the text as a number is also designed.

Compared to the proposed scoring mechanism, the multi-agent systems with simpler scoring mechanisms to make decisions, such as the average or the sum scoring mechanism, may perform similarly. Besides, LLM itself may have a similar performance, which is another doubt. The comparison results are demonstrated through a series of comparative experiments.

The structure of this paper is organised as follows: Sect. ''Related research'' presents the previous research. Following that, Sect. ''Design of multi-agent fact-checking system'' presents a comprehensive design of the multi-agent fact-checking framework. Subsequently, Sect. ''Agent workflow'' introduces the progress of each agent to judge the factuality of claims; Sect. ''Scoring mechanism of credibility'' demonstrates the mechanism to measure the credibility based on the judgment results and confidence from agents; Sect. ''Experimental evaluation'' then illustrates the experiment results of a series of comparative tests. Moreover, Sect. ''Discussion'' engages in a discussion of the strengths and weaknesses associated with this system, followed by a discussion about definition standards and dataset creation. Finally, Sect. ''Conclusion'' offers a summary and concluding remarks for the paper.

Related research

Fact verification based on LLM

Many fact-checking frameworks and applications utilising LLM have made significant strides in addressing the fact-checking task. Chern et al. developed a framework called FacTool to detect the factuality (true or false) of individual claims within articles using an LLM. It demonstrates relatively high performance in the Knowledge-based (KB) Question-Answering (QA) task on a manually created dataset called FactPrompts⁵. FactPrompts comprises 50 real-world prompts sourced from various platforms and datasets, such as Quora and TrustfulQA¹¹, along with corresponding responses generated by ChatGPT. The claims and responses in the dataset were assigned binary labels,“True” or “False”, depending on whether parts of the claims or responses contain factual inaccuracies. The evaluation results of FacTool, powered by GPT-4, showed strong performance in detecting inaccuracies within individual sentences included in the FactPrompt.

On the other hand, some other related research aimed at detecting hallucinations only depending on LLM and is easily applied to the rumour detection field. A significant assumption for this kind of solution is that LLM only responds to queries with hallucinations occasionally, and hallucinations can be avoided by generating multiple responses.

Chain-of-Verification (CoVE), raised by Dhuliawala et al.⁶, is a framework to verify if potential hallucinations exist in the responses of LLM. The framework detects potential hallucinations by generating a series of verification questions, using the same LLM to generate answers to the verification questions, and comparing the answers with the original article. Through multiple comparative experiments, it has been proven practical to reduce hallucinations and improve the performance of LLM.

Moreover, SelfCheckGPT is another framework Manakul et al. used to detect if the statements are factual by comparing multiple responses⁷. Specifically, SelfCheckGPT randomly generates similar answers with the temperature set as 0. By comparing these similar answers with the original text through BERTScore¹², n-gram similarity¹³, an automatic multiple-choice question answering generation (MQAG) framework¹⁴ and other evaluation standards to calculate the similarity of those pieces of text to verify if the original text is trustworthy.

Last but not least, multiple LLM-based MAS are designed to solve fact-checking tasks. Local is an LLM-based MAS raised by Ma et al.⁸, including multiple agents to decompose a claim into several sub-tasks for evidence retrieval, to generate solutions by reasoning and to evaluate whether the evidence supports the claim. MADR is a framework proposed by Kim et al.⁹, which includes multiple LLM-based agents and assigns diverse roles to them to enhance faithfulness in generating explanations for fact-checking tasks through iterative debate. DelphiAgent is another multi-agent verification framework proposed by Xiong et al.¹⁰, in which multiple LLM-based agents serve as experts with unique personalities to apply the Delphi method in automated fact-checking.

LLM-based multi-agent systems

While these fact-checking frameworks using LLM have made strides, combining LLM with multi-agent systems offers new possibilities for improving accuracy. The combination of multi-agent systems with LLMs reflects the profound impact of recent advancements in LLM performance on the multi-agent field. For instance, MetaGPT¹⁵ is a meta-programming framework powered by LLM-driven agents. By assigning these agents distinct roles in the software development process and providing them with appropriate instructions, MetaGPT facilitates automated software development across various tasks. In the Natural Language-Based Society of Mind (NLSOM)¹⁶, agents with distinct functions collaborate and interact to solve complex issues, such as general language-based tasks, through multiple rounds of brainstorming. On the other hand, instead of concentrating on a certain kind of task, AutoGen¹⁷ creates a general framework for developers to customise the roles and numbers of agents to build LLM-based applications.

Democratic deliberation and social discussion experiments

Plenty of social online discussion experiments were held previously, and the results retrieved from those experiments supported the design of the proposed fact-checking framework. Firstly, as noted by Rafik and Ito¹⁸, the absence of a fact-checking mechanism during democratic deliberations negatively impacts the quality of discussions and the achievement of consensus. According to the previous social discussion experiments^19–23, the situation in a single piece of text on SNS, in which multiple claims are included, is quite common. Besides, some of the fake news spreading through SNS was written intentionally to mislead readers and participants of online discussions, which means that it consists of both true and false information, leading to the result that the detection towards manually made misinformation is challenging^24,25. Therefore, claim extraction is an essential step before verifying factuality. Similarly, relying on a single source for external information when verifying factuality is insufficient.

Design of multi-agent fact-checking system

The design and workflow of the multi-agent fact-checking system can be divided into three main components: (1) Processing the original text, including claim extraction and query generation; (2) Fact-checking by individual agents, where each agent assesses the factuality (true or false) of claims and returns confidence scores for their judgment; and (3) A unique mechanism that converts these independent results and confidence levels into a final score, referred to as the Multi-Agent Fact-Checking (MAFC) Score, which reflects the overall credibility (correctness level) of articles and ranges from 0 to 1. With the score retrieved in the last step, the credibility of the original text can be easily divided into various amounts of labels according to different rumour detection tasks. The general design and overall workflow are illustrated in Fig. 1.

The general design and the overall workflow of the multi-agent fact-checking system.

Task settings and labels

In this work, we distinguish between claim-level factuality and text-level credibility. Given an input text Inline graphic , the system first extracts a set of claims . A claim is an atomic factual statement that can be judged as either correct or incorrect. For each claim , the agents execute factuality verification, producing a binary decision (True or False) with a confidence value within the range of [0, 1].

At the level of the original text Inline graphic , we define credibility as the overall correctness level of the text based on its extracted claims. For the classification experiments, we consider three credibility labels:

TRUE : all extracted claims in the text are factually correct; FALSE : all extracted claims in the text are factually incorrect; PARTLY TRUE : the text contains a mixture of correct and incorrect claims, that is, at least one true claim and at least one false claim.

The scoring mechanism of the multi-agent fact-checking system maps the agents’ claim-level judgments and confidence scores to a continuous credibility score in [0,1] for each claim and for the text as a whole. This continuous score is then judged under thresholds to obtain either binary labels (TRUE / FALSE) or the three-level labels (TRUE / PARTLY TRUE / FALSE) described above. In the rest of the paper, we refer to the former setting as a binary fact-checking task, and the latter setting as a multi-label fact-checking task in the scenario where the system must discriminate between multiple credibility levels at the text level.

LLM parameter settings and prompts

Claim extraction and query generation are implemented using GPT-4-turbo²⁶. The prompts are adapted from FacTool, where they were thoroughly tested by Chern et al. using the Robust Summarisation Evaluation (RoSE) benchmark–a large human-evaluation dataset consisting of 22,000 summary-level annotations across 28 top-performing systems on three datasets^5,11. Additionally, the temperature parameter is set to 0 to ensure stable output. The detailed prompts are demonstrated in Supplementary A2.

Agent workflow

Algorithm 1 — The workflow of single agents

The agents are designed with similar structures to simplify upgrading and adding new agents to the system. However, each agent is unique in its information source and processing method. Currently, the multi-agent fact-checking system utilises three distinct information sources for its agents: Google search APIs provided by Serper, Wikipedia APIs, and the large language model itself to verify the factuality of claims. The detailed APIs are demonstrated in Supplementary A1. In the experiments, GPT-4-turbo is used in particular. To minimise the impact of hallucinations in the LLM’s output, the agent using the LLM as its information source is designed and implemented using the SelfCheckGPT framework⁷. The agents utilising these information sources are referred to as the Google agent, Wikipedia agent, and LLM agent, respectively. In this study, all agents are treated with equal weight in the aggregation, which reflects a deliberate simplifying assumption: Google search, Wikipedia summaries and LLM self-checking provide complementary types of evidence, and our goal is to study how to aggregate heterogeneous sources under the assumption that their reliabilities are comparable. The limitation of this setting is further discussed in Sect. ''Limitation''.

Claims are first extracted from the original text and converted into queries. Each agent then searches for relevant information according to its assigned information source. The Google agent uses Google Search APIs to retrieve the top pages and extract relevant search snippets from the API’s response. Similarly, the Wikipedia agent searches for the most relevant page and retrieves its summary. In contrast, the LLM agent determines the truthfulness of a claim using a method similar to hallucination detection. For instance, it first generates a paragraph of text based on the content of the query. The agent then compares the generated text with the original claim to identify if any hallucination exists in the paragraph generated. Finally, the generated text serves as evidence to judge the factuality of the given claim.

With the external information provided, every agent uses a common prompt to verify the factuality of each claim. The model and parameters used by the LLM in this task are identical to those used for claim extraction and query generation. In claim verification, two elements are prioritised: factuality and confidence. Factuality is a binary value (true or false) representing the accuracy of a claim, while confidence is a float value between 0 and 1 that reflects the LLM’s assessment of how trustworthy the response is based on the provided information. Furthermore, the factuality of each claim is converted to 1 if it is judged as true or -1 otherwise. Then, weighted scores from individual agents for a single claim can be calculated as the factuality verification result multiplied by the confidence. By designing weighted scores accordingly, if the absolute value of the weighted score is close to 1, the judgment result is more confident. The detailed algorithm of each agent to calculate the weighted score of every claim is illustrated in Algorithm 1. The general workflow of the agents in this system is illustrated in Fig. 2. Specifically, the workflow of the Google agent and the Wikipedia agent is illustrated in Fig. 2a, and the workflow of the LLM agent is demonstrated in Fig. 2b.

The detailed design of each agent in the multi-agent system.

Scoring mechanism of credibility

Since the goal is to measure the credibility of the text, which reflects how much of the text is correct (e.g., fully true, partly true, or fully false), a scoring mechanism that combines the factuality judgments and confidence levels of each claim is necessary. The general conversion process is illustrated in Fig. 3.

The general conversion process of the mechanism.

The following definitions are given as the prerequisites of the mechanism: Assuming that an original text is called Inline graphic , and the claims set contains all the claims extracted from . Besides, a single agent is named , and the set of all agents is named . Then, for the claim , the agents that judge as true are called the set ; Similarly, the agents that judge as false are called the set . Moreover, for each agent during the process of the judgment towards a single claim, the weighted score of the agent Inline graphic representing its confidence is called . Finally, the scores representing how correct or incorrect the claim is are named and , respectively.

Based on the definitions given above, the formulas to calculate Inline graphic and are defined as follows:

As Algorithm 1 illustrates, the weighted score generated by each agent is the product of the factuality judgment result, where 1 stands for true and -1 stands for false and confidence. Therefore, the following constraints are satisfied:

Thus, the veracity score of the claim Inline graphic to express the combination of the factuality judgment results and the confidence can be calculated as:

When all agents judge a claim as true Inline graphic with full confidence, then the veracity score reaches the upper limit. Similarly, when all agents judge a claim as false with full confidence, then the veracity score reaches the lower limit. Thus, the upper limit and the lower limit can be expressed as follows, where represents the number of all agents in this system, as formula 3 has demonstrated:

Thus, the general confidence of the claim Inline graphic is presented as the multi-agent fact-checking score (MAFC Score), which is calculated as the normalisation value of veracity scores of the claim , and the range of this value is between 0 and 1:

Furthermore, for a piece of text Inline graphic including claims, it is easy to know that the upper and lower limits can be presented as:

Finally, the MAFC Score of the text Inline graphic representing its credibility can be presented as follows:

The range of the MAFC Score of a single claim and a piece of text including multiple claims is from 0 to 1. Accordingly, the label of the text’s credibility can be generated by dividing the range of the MAFC score into several parts and judging which part the actual score is in.

Explanation of the scoring mechanism

The MAFC scoring mechanism aggregates the judgments of multiple agents in two main steps. For each claim Inline graphic , every agent produces a weighted score , which is the product of a binary factuality decision (1 for true, −1 for false) and a confidence value in [0, 1]. This weighted score captures both the direction (true or false) and the strength (confidence) of the agent’s belief about the claim. Then, we separate the agents into two sets, Inline graphic and , representing agents that judge as true or false, respectively. The quantities and in Eqs. 1 and 2 are defined as the average weighted score among corresponding agents, multiplied by a factor that depends on how many agents support that decision. These two values are designed to describe the conformity of a claim is correct or not. Thus, when more agents confidently support a claim, Inline graphic becomes larger and becomes more negative when more agents confidently refute a claim. The trend meets the constraints in Eq. 3.

We design Inline graphic and particularly using the logarithmic terms and to implement a diminishing-returns effect in the number of agents which have similar judgment results, after considering several alternatives: (1) With a purely linear dependence on or , the majority can easily dominate even when each agent of the majority only has weak confidence. For instance, if two agents judge a claim as True with relatively low confidence, such as 0.5, while the third agent judges it as False with a high confidence of 0.9, a linear aggregation would tend to favour the majority ”True” decision, despite the strong opposing evidence. (2) At the other extreme, a simple average scoring mechanism like Inline graphic , ignores the size of the coalition. For example, if three agents all judge a claim as True with confidence of 0.9, the average result remains True with 0.9 confidence, even though such unanimous high confidence agreement should intuitively be more trustworthy. (3) A square root weighting, such as Inline graphic , partially addresses the dominance of large majorities, but in small panels, it can make the system overly cautious. For instance, in a scenario where 2 agents judge a claim as True and another agent judges it as False with a similar confidence of 0.6, the final with the square root weight is very close to 0, which means that the system is cautious about its credibility. In contrast, the proposed logarithmic function provides a compromise: it increases the impact of multi-agent agreement compared to averaging, but grows more slowly than a square-root or linear function, so that adding weakly confident supporters does not thoroughly wash out a strongly confident opposing agent.

Experimental evaluation

Experiments settings

Datasets

We evaluate the proposed framework on three datasets: the SciFact²⁷ dataset of scientific claims, a derived SciFact-Mixed dataset for text-level credibility, and the FEVER²⁸ dataset from the FEVER 2018 shared task. Together, they allow us to study both claim-level and text-level fact-checking across scientific and general-domain Wikipedia claims.

SciFact is a scientific claim verification dataset containing 1.4 k expert-written claims paired with abstracts from the research literature and annotated with SUPPORTS, REFUTES or NOT ENOUGH INFO labels, as well as evidential rationales. In the first group of experiments, we focus on binary factuality verification and use only claims labelled as SUPPORTED or REFUTED. In the binary fact-checking setting, we construct a balanced subset by randomly sampling 600 claims from the original dataset.

SciFact-mixed is a new dataset derived from SciFact, developed in our work to study text-level credibility of partially true content. Each entry in SciFact-Mixed is a short text composed of four SciFact claims. Concretely, we sample groups of four claims Inline graphic and concatenate them into a single input text . The text-level credibility label is then determined by the composition of claim-level factuality, following the definitions introduced in Sect. ''Task settings and labels'':

TRUE : all four claims in the text are labelled as SUPPORTED in SciFact. FALSE : all four claims in the text are labelled as REFUTED in SciFact. PARTLY TRUE : the text contains at least one SUPPORTED and at least one REFUTED claim.

We sample such groups until we obtain approximately balanced numbers of texts for each label, resulting in 100 TRUE texts, 100 FALSE texts and 100 PARTLY TRUE texts. This construction yields a controlled test platform in which the underlying claim-level factuality fully determines the text-level credibility label. It allows us to analyse how well different aggregation mechanisms propagate claim-level judgments to text-level credibility, and in particular how accurately they identify PARTLY TRUE texts. However, the resulting texts remain artificial concatenations of independent claims and do not fully reflect the narrative structure of real social media posts, in which partial truths are often embedded within a single complex sentence or a longer story. We therefore view SciFact-Mixed as a first step towards multi-level credibility estimation and discuss the implications of this limitation in Sect. ''Limitation''.

FEVER (Fact Extraction and VERification) is a large-scale fact-verification dataset constructed from Wikipedia, containing 185,445 human-written claims generated by altering sentences extracted from Wikipedia and labelled as SUPPORTED, REFUTED or NOT ENOUGH INFO. The dataset formed the basis of the FEVER 2018 shared task, in which participating systems were evaluated on their ability to retrieve evidence from Wikipedia and classify each claim as SUPPORTED or REFUTED²⁹. In this work, we use FEVER to examine whether the MAFC framework can be applied beyond scientific claims and to compare it, at a high level, to traditional fact-checking systems developed for the FEVER shared task. We follow a claim-level binary verification setting by focusing on claims labelled as SUPPORTED or REFUTED and excluding NOT ENOUGH INFO examples. Similarly, we treat the claims labelled as SUPPORTED as TRUE and those labelled as REFUTED as FALSE, as in a binary fact-checking task. We randomly sample 5000 claims from FEVER for the test set due to time and cost constraints. Since the test set of the FEVER 2018 shared task is also randomly discarded, our experimental results are directly comparable to previous results, though the mechanisms behind our framework and these trained models are entirely different.

Baselines

For the proposed binary fact-checking tasks on SciFact and the multi-label fact-checking tasks on SciFact-Mixed, we compare MAFC against two baselines designed to isolate the contributions of multi-agent aggregation and the proposed scoring mechanism. First, we include a single-agent LLM baseline, SelfCheckGPT, though it was originally proposed for hallucination detection rather than fact-checking. One reason is that the workflow of its hallucination detection, including 3 steps: generating a query from the target text, then generating an answer to that query, and finally comparing the target text with the generated answer, is structurally very similar to a fact-checking pipeline. In this sense, SelfCheckGPT can be naturally repurposed as an LLM-based fact-checker, with the evidence generated by the model itself. One might argue that this makes the method vulnerable to hallucinations in the generated evidence, but this limitation is inherent to all LLM-based frameworks and also present in its original hallucination detection use case. The other reason is that SelfCheckGPT outputs a continuous hallucination score rather than only a binary judgment. This continuous score can be directly reused as a credibility score in our multi-label fact-checking tasks. Meanwhile, many other LLM-based fact-checking approaches provide only discrete binary judgments and thus cannot be straightforwardly adapted to our graded credibility task without substantial modification.

Second, we construct two multi-agent baselines that use precisely the same three agents as the MAFC system of the Google agent, the Wikipedia agent and the LLM agent, but aggregate their outputs with similar scoring mechanisms. The average baseline computes the mean of the agents’ weighted scores, while the normalised sum baseline sums these scores and rescales them to [0, 1]. These baselines represent natural ways to combine multiple agents and allow us to directly assess whether the MAFC scoring mechanism provides benefits beyond simple pooling of agent predictions.

In classic fact-checking baselines such as ClaimBuster and supervised models developed for FEVER, fact-checking is framed as a supervised learning problem, in which models are trained to classify claims based on evidence from a fixed corpus. In principle, these systems could serve as additional baselines for our experiments. However, adapting them to our setting is non-trivial: they typically assume a specific domain and input format, require access to their original training data and code, and are tightly coupled to particular evidence retrieval pipelines. In the case of ClaimBuster, we attempted to use the official API but were unable to obtain reliable responses during our experiments, preventing us from producing a fair and reproducible comparison on our datasets. Re-implementing ClaimBuster or retraining FEVER-style models for the scientific domain and our multi-level credibility labels lies beyond the scope of this work. Instead, we evaluate MAFC directly on the FEVER dataset and compare it to models submitted to the FEVER 2018 shared task²⁹, providing an indirect comparison to traditional supervised fact-checking approaches.

Comparative experiments of the binary fact-checking task

The first group of comparative experiments was conducted using the SciFact dataset²⁷, which contains 1.4k expert-written scientific claims paired with abstracts, along with supporting or refuting evidence to assess the factuality (true or false) of the claims. For these experiments, a smaller dataset was created by randomly extracting an equal number of correct and incorrect claims from the original dataset. In this set of experiments, claims with an MAFC score below 0.5 were classified as FALSE, while those with a score of 0.5 or higher were classified as TRUE. Besides, since the original aim of SelfCheckGPT⁷ is to detect potential hallucination in the output of the LLM, if the LLM agent finds any hallucination in the claim given, then the claim would be judged as FALSE. The multi-agent system with an average scoring mechanism and the multi-agent system with a sum scoring mechanism are also added as comparisons to verify if the proposed MAFC scoring mechanism has advantages. The same judgment rule was applied to both multi-agent systems: scores below 0.5 were classified as FALSE, and scores of 0.5 or higher were classified as TRUE. The agents judge the factuality of claims and give their confidence in the judgment results, and the final judgment result depends on the result of average scores or the normalisation result of average scores, respectively. The results of the binary fact-checking task using different methods are illustrated in Fig. 4.

As Fig. 4a, b, c and d illustrate, the proposed method achieved the highest average accuracy in classifying the factuality (true or false) of individual claims, outperforming all the other methods in this set of experiments. Notably, it performs approximately 7% better than the single SelfCheckGPT overall, demonstrating superior performance in binary fact-checking compared to previous research. Besides, the proposed method remains with slight advantages in the other 3 metrics: highest precision (0.78 of macro precision and 0.82 of weighted precision), highest recall (0.81 of macro recall and 0.79 of weighted recall) and highest F1 score (0.78 of macro F1 score and 0.8 of weighted F1 score).

On the other hand, two other multi-agent systems with different scoring mechanisms are also used in the comparison experiments, and the multi-agent system using the sum scoring mechanism performs far better than the one using the average scoring mechanism. However, both of these two multi-agent systems perform worse than the proposed method, which means that the proposed MAFC scoring mechanism does have an advantage in the binary fact-checking tasks. The reasons why the proposed method performs better than all the other methods will be clarified in the Sect. ''Evaluation of the proposed method against SelfCheckGPT''.

Comparative experiments of the multi-label fact-checking task

Another group of experiments is conducted to measure the performance difference in the task of multi-label classification fact-checking among the same targets for comparison, including the proposed method, the individual SelfCheckGPT agent, the multi-agent system with the average scoring mechanism and the multi-agent system with the sum scoring mechanism. Since the original SciFact²⁷ dataset only contains two types of labels to describe if the claim is correct or not, a new dataset is made for this set of experiments based on the original SciFact dataset. Specifically, a few hundred claims are extracted randomly from the original dataset, with each entry in the new dataset containing four claims from the original dataset. For example, if four correct claims are combined into a piece of text in the new dataset, then the label of the text in the new dataset is “TRUE”. Similarly, if four incorrect claims are combined into a new piece of text in the new dataset, the label of this text should be “FALSE”. Otherwise, the label of the text is “PARTLY TRUE”. The dataset process and the corresponding label assignment are totally based on the definition in Sect. ''Task settings and labels''.

The setting of the scoring mechanisms is as follows: Each agent judges if the claims contained in the text of the new dataset are correct or not. Then, agents give confidence scores to represent their confidence in the judgment results. These two steps are the same as the proposed method. After that, the average scoring mechanism calculates the average score of the text, and the sum scoring mechanism calculates the normalised sum score of the text. If the score from the multi-agent system was below 0.33, the judgment result was classified as “FALSE”. Scores above 0.67 were classified as “TRUE”, while scores between 0.33 and 0.67 were classified as “PARTLY TRUE”. The results of the multi-label fact-checking task using different methods are illustrated in Fig. 5.

The results of the comparative experiments of the multi-label classification fact-checking task among targets for comparison.

As Fig. 5a, b, c, and d illustrate, the proposed method performs far better than the single Self-Check GPT agent in the multi-label classification fact-checking task, on the other hand, the multi-agent system with the sum scoring mechanism performs far better than the multi-agent system with the average scoring mechanism, still, the proposed method has an advantage in performance of this task than both two multi-agent systems. The proposed method has significant advantages in all metrics of accuracy, precision, recall and F1 score compared to other baselines. Even though the normalised sum scoring MAS yields close numbers on these metrics, we can also see that the proposed method performs well-balanced across all three labels, whereas the normalised sum scoring mechanism performs poorly on the PARTLY TRUE label fact-checking. The reasons for these phenomena will be described in the Sect. ''Evaluation of the proposed method against multi-agent systems with different scoring mechanisms''.

Comparative experiments on FEVER

To compare the proposed framework with traditional fact-verification systems, we also evaluate MAFC on the FEVER dataset. As described in Sect. ''Datasets'', FEVER consists of human-written claims constructed from Wikipedia and annotated with labels indicating whether each claim is supported or refuted by Wikipedia evidence. In our experiments, we adopt a binary factuality setting and apply MAFC to classify FEVER claims as true or false. Importantly, MAFC is used in a zero-shot manner on FEVER, without any supervised training on FEVER labels. The baselines compared to our MAFC frameworks are the top 4 models recorded in the FEVER 2018 shared task²⁹, including UNC-NLP³⁰, UCL Machine Reading Group³¹, Athene UKP TU Darmstadt³², and Papelo³³. The comparison results are demonstrated in Table 2.

Table 2.

Comparative experiments results on FEVER.

	Precision	Recall	F1 Score	Accuracy
Ours	0.72	0.68	0.66	0.68
UNC-NLP	0.42	0.71	0.53	0.68
UCL machine reading group	0.22	0.83	0.35	0.67
Athene UKP TU darmstadt	0.23	0.85	0.37	0.65
Papelo	0.92	0.50	0.64	0.61

Open in a new tab

As results illustrate, MAFC achieves an F1 score of 0.66, which is the highest among all compared models, and an accuracy of 0.68, matching the best-performing FEVER shared-task model, UNC-NLP. In terms of precision and recall, MAFC performs well-balanced, with a precision of 0.72 and a recall of 0.68. Whereas Papelo has the highest precision of 0.92 with the lowest recall of 0.5, and UCL Machine Reading Group and Athene UKP TU Darmstadt have high recall scores of 0.83 and 0.85 with low precision of 0.22 and 0.23, respectively. These results indicate that MAFC can achieve performance comparable to, and in some metrics better than, supervised FEVER systems while maintaining a more balanced precision–recall profile.

We emphasise that this comparison is conservative. These baselines are trained and fine-tuned specifically on FEVER by separating the dataset into train, dev and test sets. The fact that our proposed method remains competitive under these conditions suggests that the proposed MAFC framework, including the scoring mechanism, is not restricted to scientific claims.

Statistical analysis

To quantify uncertainty in our results due to the relatively small scale of the datasets, we estimate confidence intervals using a non-parametric bootstrap on the test set. For each dataset and method that we implemented, we generated 1000 bootstrap resamples of the test set by sampling texts with replacement. We recompute accuracy, macro-precision, macro-recall, and macro-F1 on each resample, and report the 2.5% and 97.5% of these values as 95% confidence intervals. The comparison results are demonstrated in Table 3. For the FEVER dataset, we only have access to aggregate scores reported in the original paper²⁹ and cannot recover per-example predictions, so we cannot run paired bootstrap tests against these models but quote their original results.

Table 3.

95% Confidence intervals (CIs) of experiment results using non-parametric bootstrap.

	Precision	Recall	F1 score	Accuracy
Binary fact-checking tasks
Ours	0.77, [0.72, 0.84]	0.81, [0.75, 0.87]	0.78, [0.72, 0.84]	0.79, [0.74, 0.85]
SelfCheckGPT	0.70, [0.64, 0.76]	0.72, [0.66, 0.79]	0.70, [0.64, 0.77]	0.72, [0.66, 0.78]
Average scoring	0.73, [0.68, 0.77]	0.73, [0.68, 0.78]	0.65, [0.58, 0.71]	0.65, [0.56, 0.71]
Normalised sum scoring	0.76, [0.70, 0.81]	0.79, [0.73, 0.84]	0.75, [0.69, 0.81]	0.76, [0.70, 0.82]
Multi-label fact-checking tasks
Ours	0.96, [0.92, 0.99]	0.96, [0.92, 0.99]	0.96, [0.92, 0.99]	0.96, [0.91, 0.99]
SelfCheckGPT	0.57, [0.39, 0.45]	0.56, [0.48, 0.63]	0.50, [0.41, 0.59]	0.57, [0.47, 0.66]
Average scoring	0.50, [0.46, 0.54]	0.46, [0.41, 0.51]	0.41, [0.34, 0.46]	0.46, [0.37, 0.57]
Normalised sum scoring	0.92, [0.86, 0.97]	0.91, [0.85, 0.96]	0.91, [0.84, 0.96]	0.91, [0.84, 0.96]

Open in a new tab

Discussion

Evaluation of the proposed method against SelfCheckGPT

Since all agents in the proposed method are powered by GPT-4-turbo, it is important to clarify the performance differences between a single LLM-based agent and the proposed multi-agent approach. Since SelfCheckGPT was designed to decrease the output of hallucinations from large language models, and it has excellent capability in detecting hallucinations⁷, it is solid enough to compare the performance gap between the proposed method and the SelfCheckGPT agent instead of a simple LLM. According to the experiment results illustrated in Fig. 4a and b, although the SelfCheckGPT agent based on GPT-4-turbo is highly effective at judging claim correctness, the proposed method demonstrates clear advantages in binary classification fact-checking, achieving an average accuracy of around 78% with consistent performance across correct and incorrect claims. However, as Fig. 4b illustrates, the disadvantages of the SelfCheckGPT agent or LLM itself are not only relatively low accuracy. While SelfCheckGPT achieved nearly 77% accuracy in classifying correct claims, its accuracy dropped to 63% for incorrect claims, resulting in a general accuracy difference of about 7%. Thus, the proposed method not only has better overall accuracy performance but also more stability in the binary classification of fact-checking results.

On the other hand, the proposed method performs overwhelmingly better than the single SelfCheckGPT in multi-label classification fact-checking tasks. As Fig. 5a illustrates, the proposed method could reach 97% of accuracy in this kind of task. However, as shown in Fig. 5b, the SelfCheckGPT achieved only 57% accuracy. While it performed relatively well in classifying fully true or fully false claims, it struggled to detect false claims within partially true texts, leading to significantly lower overall accuracy compared to the proposed method. Therefore, this comparison demonstrates the proposed method’s excellent capability to detect sentences whose parts are false in factuality.

Furthermore, the proposed method also consistently outperforms SelfCheckGPT in other metrics in both binary and multi-label fact-checking tasks. In the binary task, our method achieves macro precision, recall, and F1-scores of 0.78, 0.81, and 0.78 (vs. 0.70, 0.72, and 0.70 for SelfCheckGPT), and weighted precision, recall, and F1-scores of 0.82, 0.79, and 0.80 (vs. 0.74, 0.72, and 0.72), as illustrated in Fig. 4a and b. These gains indicate that our approach is both more reliable on a per-class basis and better aligned with the overall data distribution. The advantage becomes even more pronounced in the multi-label setting, where the proposed method attains 0.96 precision, recall, and F1-score, compared with 0.57, 0.56, and 0.50 for SelfCheckGPT, as illustrated in Fig. 5a and b. Taken together, these results suggest that our method not only reduces missed or misclassified claims but also scales more robustly when the fact-checking task requires simultaneous judgments over multiple labels.

Evaluation of the proposed method against multi-agent systems with different scoring mechanisms

The comparison among multi-agent systems using different scoring mechanisms is also necessary and inevitable, since different scoring mechanisms may lead to different conclusions. Across different MAS scoring mechanisms, MAFC yields the strongest and most stable performance in both binary and multi-label fact-checking tasks. In the binary fact-checking tasks as Fig. 4a, c and d illustrate, MAFC achieves the highest accuracy (0.79) and F1-scores (0.78 macro, 0.80 weighted), while maintaining balanced precision and recall, whereas average scoring suffers from much lower accuracy and recall (0.65) despite comparable precision, and normalised sum scoring remains slightly behind MAFC in both accuracy (0.76) and F1. The advantage of MAFC is even clearer in the multi-label fact-checking tasks as Fig. 5a, c and d illustrate, where it attains 0.97 accuracy and 0.96 precision/recall/F1 score, substantially outperforming average scoring (0.55 accuracy, 0.41 F1 score) and still improving over normalised sum scoring (0.91 accuracy, 0.91 F1 score). These results indicate that the MAFC aggregation rule not only preserves high per-class and overall accuracy but also scales more robustly when aggregating multiple labels than simpler averaging or normalised-sum schemes.

As Fig. 4a, c and d illustrate, the multi-agent system using the average scoring mechanism performed significantly worse than the other two in binary classification fact-checking tasks. This discrepancy occurs because the accuracy of each agent’s judgment is heavily influenced by the reliability of its assigned information source. In this case, the scores calculated through the average scoring mechanism are influenced mainly by the wrong judgments. In contrast, both the sum scoring mechanism and the proposed MAFC mechanism are more effective at minimising the impact of incorrect judgments, with the proposed method offering further improvement through its unique algorithm.

Besides, through the comparison of multi-agent systems using different scoring mechanisms, their features can also be revealed. As Fig. 5c illustrates, the multi-agent system with the average scoring mechanism has the worst accuracy, and it has the same feature of lacking the capability to detect incorrect claims hiding in a mixed sentence as the single SelfCheckGPT does. The multi-agent systems with the proposed mechanism and the normalised sum scoring mechanism have similar accuracy compared to each other, but the former still performs better in classifying false and mixed sentences than the latter. Significantly, the recall rates of entirely true or false sentences of the proposed method are lower than those of the normalised sum scoring mechanism, which means that the simple normalised sum scoring mechanism has the disadvantage that, after the sum calculation of scores, some mixed sentences are misjudged as completely false or true sentences.

Limitation

While MAFC combines multiple information sources, the current implementation remains homogeneous at the model level: all agents are powered by the same backbone LLM, GPT-4-turbo. The agents are therefore diverse in terms of the evidence they access, but not in their underlying reasoning architecture. This design isolates the effect of evidence aggregation and the proposed scoring mechanism, but it also implies that systematic biases or hallucinations of the backbone model can propagate across agents. In particular, if the backbone systematically misinterprets a specific type of claim, the multi-agent system may amplify rather than correct this error. Applying different LLMs to the LLM-based agents implemented in the proposed framework may reduce the risk of hallucination influence, but it cannot eliminate hallucinations, since they stem from the common transformer architecture of LLMs, regardless of which LLM is used. Besides, it may also decrease prompt stability; the LLM’s performance is prompt-sensitive. An important potential future direction is to separate the agents’ roles, enabling multiple agents to reason and reducing the risk of hallucinations, though it will introduce the problem of cost.

Our experimental evaluation is also subject to limitations related to the dataset. The SciFact and SciFact-Mixed experiments focus on scientific claims, which differ in style and topic from the conversational and politically charged content that often appears on social media platforms. SciFact-Mixed further relies on synthetic texts constructed by concatenating multiple claims, yielding a controlled setting in which text-level credibility is fully determined by claim-level factuality but does not fully capture the narrative structure of naturally occurring mixed-veracity posts. The additional experiments on the FEVER dataset demonstrate that MAFC can be applied to general-domain Wikipedia claims and achieve performance comparable to traditional fact-checking systems developed for the FEVER shared task, yet FEVER claims remain short, decontextualised sentences. A thorough assessment of MAFC on longer, user-generated posts and real social media data remains as future work.

We further discuss the limitations of the MAFC framework’s design in this paragraph. We qualitatively inspected misclassified instances that differ from the claims in the datasets used in our experiments and noticed that the proposed method performs poorly on some strongly time-related claims. For example, past political claims from election candidates on social media or rumours about COVID-19. One reason is that the proposed MAFC framework verifies claims using evidence retrieved from the Internet, ignoring temporal factors, leading to the consequence that the proposed method uses current information to verify past claims. In this case, many claims which were correct in the past are judged as false. The other reason is that the evidence returned by a web or Wikipedia search is incomplete or ambiguous in this scenario. Another potential risk is the evidence sources selected in the experiments: we use Google search, Wikipedia, and the LLM itself as three distinct information sources and assume they carry equal weight as evidence. The reasons we select these information sources are that they cover most claims for fact-checking tasks and provide complementary types of information, as mentioned in Sect. ''Agent workflow'', but we also need to admit that for fact-checking tasks in particular fields, other information sources which are more trustworthy like fact-checking databases or academic papers may be better. In such scenarios, the agents with more trustworthy information sources can be assigned higher weights in the fact-checking tasks. Furthermore, when evidence from different sources genuinely conflicts, the agents tend to split, and MAFC typically assigns an intermediate credibility score rather than a confident TRUE or FALSE label, reflecting this uncertainty. Overall, these patterns suggest that MAFC is most reliable when multiple sources provide clear, consistent evidence for or against a claim, and less reliable when evidence is scarce, ambiguous or requires fine-grained domain reasoning. Last but not least, the threshold for judging multi-label fact-checking tasks is too subjective: separating the MAFC scores into 3 ranges for three-label fact-checking seems to make sense, but we may need to adjust the threshold when there are more labels, like mostly-true, mostly-false and partly-true.

Although MAFC is implemented as an automated multi-agent system, its internal structure already provides several interpretable components. For each input text, the framework first extracts explicit claims and then records, for each claim and each agent, a binary judgment (true or false), a confidence score and the retrieved evidence snippets from web search or Wikipedia. The final text-level credibility score is obtained by aggregating these claim-level and agent-level quantities through the MAFC scoring mechanism. In principle, these intermediate results can be made directly available to users, allowing them to inspect which claims were judged correct or incorrect, how strongly each agent supported these judgments, and which pieces of evidence were used. This yields a more structured view than a single-shot LLM decision that outputs only one label or score. At the same time, we acknowledge that MAFC is not yet equipped with a dedicated explanation interface, and we do not evaluate explanation quality or user trust. For high-stakes domains such as health or law, MAFC should therefore be used only as an assistive tool with human oversight, and additional work is needed to design user-facing explanations that highlight evidence, disagreements between agents and residual uncertainty in a transparent way.

Finally, the multi-agent design incurs higher computational costs than single-agent fact-checking. For a text with Inline graphic extracted claims and agents, the proposed MAFC system requires retrieval and verification calls, leading to an approximate -fold increase in LLM usage relative to a single LLM-based agent with similar functions like claim extraction, query generation and evidence-based fact-checking. Compared to a simpler LLM-based agent, the cost is further higher. The cost may be acceptable in scenarios where accuracy and robustness are the focus, but it is a significant limitation in a real-time, high-throughput scenario. The proper use of retrieval-augmented generation (RAG) methods and the locally deployed evidence documents for specific discussion topics are natural directions for improving scalability and will be explored in future work.

Conclusion

This paper presents the design and implementation of a multi-agent fact-checking framework, driven by LLM, to detect and classify the factuality of individual claims and measure the overall credibility of texts containing rumours or misinformation. The multi-agent fact-checking framework in this paper includes three main steps: (1) claim extraction and query generation based on the original sentence, (2) judgment results of each claim extracted, which are retrieved from every agent, and (3) convert from the factuality judgment results of the claims, which are binary, to the credibility judgment results of the original sentence, which are multi-label, through a unique scoring mechanism. Multiple comparative experiments and their results are also described and illustrated to verify the feasibility of the proposed multi-agent fact-checking framework. The results of the first group’s comparative experiments demonstrate that the proposed method has performance advantages in traditional binary fact-checking tasks. The second group’s comparative experiments reveal that the proposed method performs overwhelmingly better than other methods in the tasks of multi-label fact-checking, including a single LLM-based agent and multi-agent systems with previous scoring mechanisms. Finally, this paper discusses the advantages and disadvantages of the proposed multi-agent fact-checking framework based on the results of experiments, with a further consideration of the reasons why it performs better.

By illustrating the feasibility of the framework and describing the potential reasons for the phenomena in experimental results, the proposed multi-agent fact-checking framework can be considered a contribution to the multiple-label fact-checking field. Furthermore, the definition of the multi-label fact-checking fields is still unclear. Indeed, there are multiple datasets with claims having multiple credibility labels, but the reason for and how to create those credibility labels for those claims was not explained clearly. Thus, the design of the experiments and the proposed methods raised in this paper can also be considered a baseline for the contribution to the multi-label fact-checking field.

Author contributions

Y.D. investigated related research, proposed the method raised in the manuscript, conducted comparative experiments, wrote the manuscript, and provided funding. T.I. supervised the research, reviewed the manuscript and provided funding.

Funding

This work was supported by JST CREST Grant Number JPMJCR20D1, Japan and JST SPRING, Grant Number JPMJSP2110, Japan.

Data availability

The data for the comparative experiments in the manuscript are retrieved from SciFact, which is a public dataset. The DOI of their paper is as follows: 10.18653/v1/2020.emnlp-main.609 Another group of comparative experiments are conducted on FEVER, another public dataset. The DOI of their paper is: https://doi.org/10.48550/arXiv.1803.05355

Code availability

The code has been uploaded to the following repository and is publicly accessible: https://github.com/Ito-takayuki-lab/multi-agent-fact-checking.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Gupta, M., Dennehy, D., Parra, C. M., Mäntymäki, M. & Dwivedi, Y. K. Fake news believability: The effects of political beliefs and espoused cultural values. Inf. Manag.60, 103745. 10.1016/j.im.2022.103745 (2023). [Google Scholar]
2.Shao, C. et al. The spread of low-credibility content by social bots. Nat. Commun.9, 1–9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Vosoughi, S., Roy, D. & Aral, S. The spread of true and false news online. Science359, 1146–1151 (2018). [DOI] [PubMed] [Google Scholar]
4.Guo, Z., Schlichtkrull, M. & Vlachos, A. A survey on automated fact-checking. Trans. Assoc. Comput. Linguist.10, 178–206 (2022). [Google Scholar]
5.Chern, I. et al. Factool: Factuality detection in generative ai-a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528 (2023).
6.Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).
7.Manakul, P., Liusie, A. & Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).
8.Ma, J., Hu, L., Li, R. & Fu, W. Local: Logical and causal fact-checking with llm-based multi-agents. In Proceedings of the ACM on Web Conference 2025, WWW ’25 1614–1625 (Association for Computing Machinery, New York, NY, USA, 2025). 10.1145/3696410.3714748.
9.Kim, K. et al. Can llms produce faithful explanations for fact-checking? Towards faithful explainable fact-checking via multi-agent debate. arXiv preprint arXiv:2402.07401 (2024).
10.Xiong, C., Zheng, G., Ma, X., Li, C. & Zeng, J. Delphiagent: A trustworthy multi-agent verification framework for automated fact verification. Inf. Process. Manag.62, 104241 (2025). [Google Scholar]
11.Liu, Y. et al. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981 (2022).
12.Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
13.Kondrak, G. N-gram similarity and distance. In International symposium on string processing and information retrieval 115–126 (Springer, 2005).
14.Manakul, P., Liusie, A. & Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. arXiv preprint arXiv:2301.12307 (2023).
15.Hong, S. et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
16.Zhuge, M. et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066 (2023).
17.Wu, Q. et al. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
18.Hadfi, R. & Ito, T. Augmented democratic deliberation: Can conversational agents boost deliberation in social media? In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22 1794–1798 (International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2022).
19.Hadfi, R., Haqbeen, J., Sahab, S. & Ito, T. Argumentative conversational agents for online discussions. J. Syst. Sci. Syst. Eng.30, 450–464 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Hadfi, R. & Ito, T. Exploring interaction hierarchies in collaborative editing using integrated information. In Collective Intelligence Conference, ACM (2021).
21.Ito, T., Hadfi, R. & Suzuki, S. An agent that facilitates crowd discussion: A crowd discussion support system based on an automated facilitation agent. Group Decis. Negot.31, 1–27 (2022).35194322 [Google Scholar]
22.Lu, Y., Heatherly, K. A. & Lee, J. K. Cross-cutting exposure on social networking sites: The effects of SNS discussion disagreement on political participation. Comput. Hum. Behav.59, 74–81. 10.1016/j.chb.2016.01.030 (2016). [Google Scholar]
23.Gil-de-Zúñiga, H., Jung, N. & Valenzuela, S. Social media use for news and individuals’ social capital, civic engagement and political participation. J. Comput. Med. Commun.17, 319–336. 10.1111/j.1083-6101.2012.01574.x (2012). [Google Scholar]
24.Shu, K., Sliva, A., Wang, S., Tang, J. & Liu, H. Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newsl.19, 22–36. 10.1145/3137597.3137600 (2017). [Google Scholar]
25.Zhang, X. & Ghorbani, A. A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag.57, 102025. 10.1016/j.ipm.2019.03.004 (2020). [Google Scholar]
26.Openai. Chatgpt: Optimizing language models for dialogue (accessed 08 Jan 2024, 2022). Available online at: https://openai.com/blog/chatgpt
27.Wadden, D. et al. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 7534–7550 (Association for Computational Linguistics, Online, 2020). 10.18653/v1/2020.emnlp-main.609 [DOI]
28.Thorne, J., Vlachos, A., Christodoulopoulos, C. & Mittal, A. FEVER: A large-scale dataset for fact extraction and VERification. In NAACL-HLT (2018).
29.Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A. The fact extraction and VERification (FEVER) shared task. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 1–9 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5501 [DOI]
30.Nie, Y., Chen, H. & Bansal, M. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 6859–6866 (2019).
31.Yoneda, T., Mitchell, J., Welbl, J., Stenetorp, P. & Riedel, S. UCL machine reading group: Four factor framework for fact finding (HexaF). In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 97–102 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5515 [DOI]
32.Hanselowski, A. et al. UKP-athene: Multi-sentence textual entailment for claim verification. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 103–108 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5516 [DOI]
33.Malon, C. Team papelo: Transformer networks at FEVER. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 109–113 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5517 [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code has been uploaded to the following repository and is publicly accessible: https://github.com/Ito-takayuki-lab/multi-agent-fact-checking.

[CR1] 1.Gupta, M., Dennehy, D., Parra, C. M., Mäntymäki, M. & Dwivedi, Y. K. Fake news believability: The effects of political beliefs and espoused cultural values. Inf. Manag.60, 103745. 10.1016/j.im.2022.103745 (2023). [Google Scholar]

[CR2] 2.Shao, C. et al. The spread of low-credibility content by social bots. Nat. Commun.9, 1–9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Vosoughi, S., Roy, D. & Aral, S. The spread of true and false news online. Science359, 1146–1151 (2018). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Guo, Z., Schlichtkrull, M. & Vlachos, A. A survey on automated fact-checking. Trans. Assoc. Comput. Linguist.10, 178–206 (2022). [Google Scholar]

[CR5] 5.Chern, I. et al. Factool: Factuality detection in generative ai-a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528 (2023).

[CR6] 6.Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).

[CR7] 7.Manakul, P., Liusie, A. & Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).

[CR8] 8.Ma, J., Hu, L., Li, R. & Fu, W. Local: Logical and causal fact-checking with llm-based multi-agents. In Proceedings of the ACM on Web Conference 2025, WWW ’25 1614–1625 (Association for Computing Machinery, New York, NY, USA, 2025). 10.1145/3696410.3714748.

[CR9] 9.Kim, K. et al. Can llms produce faithful explanations for fact-checking? Towards faithful explainable fact-checking via multi-agent debate. arXiv preprint arXiv:2402.07401 (2024).

[CR10] 10.Xiong, C., Zheng, G., Ma, X., Li, C. & Zeng, J. Delphiagent: A trustworthy multi-agent verification framework for automated fact verification. Inf. Process. Manag.62, 104241 (2025). [Google Scholar]

[CR11] 11.Liu, Y. et al. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981 (2022).

[CR12] 12.Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).

[CR13] 13.Kondrak, G. N-gram similarity and distance. In International symposium on string processing and information retrieval 115–126 (Springer, 2005).

[CR14] 14.Manakul, P., Liusie, A. & Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. arXiv preprint arXiv:2301.12307 (2023).

[CR15] 15.Hong, S. et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).

[CR16] 16.Zhuge, M. et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066 (2023).

[CR17] 17.Wu, Q. et al. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).

[CR18] 18.Hadfi, R. & Ito, T. Augmented democratic deliberation: Can conversational agents boost deliberation in social media? In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22 1794–1798 (International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2022).

[CR19] 19.Hadfi, R., Haqbeen, J., Sahab, S. & Ito, T. Argumentative conversational agents for online discussions. J. Syst. Sci. Syst. Eng.30, 450–464 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Hadfi, R. & Ito, T. Exploring interaction hierarchies in collaborative editing using integrated information. In Collective Intelligence Conference, ACM (2021).

[CR21] 21.Ito, T., Hadfi, R. & Suzuki, S. An agent that facilitates crowd discussion: A crowd discussion support system based on an automated facilitation agent. Group Decis. Negot.31, 1–27 (2022).35194322 [Google Scholar]

[CR22] 22.Lu, Y., Heatherly, K. A. & Lee, J. K. Cross-cutting exposure on social networking sites: The effects of SNS discussion disagreement on political participation. Comput. Hum. Behav.59, 74–81. 10.1016/j.chb.2016.01.030 (2016). [Google Scholar]

[CR23] 23.Gil-de-Zúñiga, H., Jung, N. & Valenzuela, S. Social media use for news and individuals’ social capital, civic engagement and political participation. J. Comput. Med. Commun.17, 319–336. 10.1111/j.1083-6101.2012.01574.x (2012). [Google Scholar]

[CR24] 24.Shu, K., Sliva, A., Wang, S., Tang, J. & Liu, H. Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newsl.19, 22–36. 10.1145/3137597.3137600 (2017). [Google Scholar]

[CR25] 25.Zhang, X. & Ghorbani, A. A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag.57, 102025. 10.1016/j.ipm.2019.03.004 (2020). [Google Scholar]

[CR26] 26.Openai. Chatgpt: Optimizing language models for dialogue (accessed 08 Jan 2024, 2022). Available online at: https://openai.com/blog/chatgpt

[CR27] 27.Wadden, D. et al. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 7534–7550 (Association for Computational Linguistics, Online, 2020). 10.18653/v1/2020.emnlp-main.609 [DOI]

[CR28] 28.Thorne, J., Vlachos, A., Christodoulopoulos, C. & Mittal, A. FEVER: A large-scale dataset for fact extraction and VERification. In NAACL-HLT (2018).

[CR29] 29.Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A. The fact extraction and VERification (FEVER) shared task. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 1–9 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5501 [DOI]

[CR30] 30.Nie, Y., Chen, H. & Bansal, M. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 6859–6866 (2019).

[CR31] 31.Yoneda, T., Mitchell, J., Welbl, J., Stenetorp, P. & Riedel, S. UCL machine reading group: Four factor framework for fact finding (HexaF). In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 97–102 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5515 [DOI]

[CR32] 32.Hanselowski, A. et al. UKP-athene: Multi-sentence textual entailment for claim verification. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 103–108 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5516 [DOI]

[CR33] 33.Malon, C. Team papelo: Transformer networks at FEVER. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER) (eds Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C. & Mittal, A.) 109–113 (Association for Computational Linguistics, Brussels, Belgium, 2018). 10.18653/v1/W18-5517 [DOI]

PERMALINK

Multi-agent systems and credibility-based advanced scoring mechanism in fact-checking

Yihan Dong

Takayuki Ito

Abstract

Introduction

Table 1.

Related research

Fact verification based on LLM

LLM-based multi-agent systems

Democratic deliberation and social discussion experiments

Design of multi-agent fact-checking system

Figure 1.

Task settings and labels

LLM parameter settings and prompts

Agent workflow

Algorithm 1.

Figure 2.

Scoring mechanism of credibility

Figure 3.

Explanation of the scoring mechanism

Experimental evaluation

Experiments settings

Datasets

Baselines

Comparative experiments of the binary fact-checking task

Figure 4.

Comparative experiments of the multi-label fact-checking task

Figure 5.

Comparative experiments on FEVER

Table 2.

Statistical analysis

Table 3.

Discussion

Evaluation of the proposed method against SelfCheckGPT

Evaluation of the proposed method against multi-agent systems with different scoring mechanisms

Limitation

Conclusion

Author contributions

Funding

Data availability

Code availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases