Abstract
With the explosion of health related information in mainstream discourse, distinguishing accurate health-related claims from misinformation is important. Using computational tools and algorithms to help is key. Our focus in this paper is on the hormone Melatonin which is claimed to have broad health benefits and largely sold as a supplement. This paper introduces ‘MelAnalyze,’ a framework for using generative and transformer-based deep learning models adapted as a natural language inference (NLI) task, to semi-automate the fact-checking of general melatonin claims. MelAnalyze is built upon a comprehensive collection of melatonin-related scientific abstracts from PubMed for validation. The framework incorporates components for precise extraction of information from scientific literature, semantic similarity and NLI. At its core, MelAnalyze leverages pretrained NLI models that are fine-tuned on melatonin-specific claims along with semantic search based on vectorized representation of the articles. The models are tested on melatonin claims from Google and Amazon product description to evaluate the system’s utility. The best model, fine-tuned versions of LLaMA1, attain good precision, recall, and F1-score of 0.92. We also introduce a web-based prototype tool to visualize evidence and support fact-checking algorithm evaluation. In summary, we show MelAnalyze’s role in empowering users and researchers to assess melatonin-related claims using evidence-based decision-making.
Keywords: Scientific claim validation, Retrieval-augmented verification, Sentence-level fact checking, Melatonin fact checking, Biomedical text mining
Introduction
In this era of information overload, it is easier than ever to access and share knowledge. Online platforms and social media networks serve as mediums for the global exchange of such knowledge. While such mediums allow easy dissemination of breakthroughs, health insights etc., they also can be used for spreading of false information. Such false information, especially when related to medical and health areas, can quickly proliferate and even lead to misleading product claims on e-commerce websites for driving sales [1]. These claims often directly influence consumer purchasing behaviors. Traditionally scientific claims were largely limited to academic journals and expert discussions, but nowadays they quickly become a part of public discourse, from traditional remedies’ effectiveness to modern pharmaceuticals’ safety [2]. Their potential impact calls for a systematic approach to validate these claims. This issue is even more a cause of concern when considering natural compounds, such as supplements, which are usually outside the purview regulatory authorities like the FDA. The effects of this misinformation include influencing consumer health choices, misleading consumers on product efficacies, and even potentially leading to adverse health outcomes [3]. Using scientific evidence for validation, we can counter the false claims effectively [4]. However, manually sifting through the vast number of scientific literature to validate every claim is neither feasible nor efficient. Automated fact-checking can be used as a powerful tool to tackle the false information challenge. Recent advancements in Natural Language Processing (NLP) with models such GPT-3, GPT-4 and BERT have led to robust tools for processing and natural language understanding. These models excel in a range of language-related tasks, from machine translation to sentiment analysis, but their potential in fact-checking lies in their capacity to comprehend and contextualize textual information [5, 6].
As a case study to drive our approach for automated fact-checking, we focus on the domain of Melatonin-related claims. Melatonin, is a hormone produced in the pineal gland has seen a lot of interest for its diverse physiological effects, including an important role in circadian rhythms and applications for curing conditions such as sleep disorders and jet lag [7]. Melatonin seems to have a multifaceted role in regulating various physiological processes. Scientific studies have explored its potential as a sleep aid, regulator of circadian rhythms [8–10], as well antioxidant and anti-inflammatory effects. In fact, public interest in melatonin extends beyond its role in circadian rhythms and sleep regulation. Melatonin has been marketed as a potential treatment for various health conditions, including anxiety, depression, and even cardiovascular diseases, cancer, sepsis and COVID-19 [11, 12]. A large number of claims surrounding melatonin’s health benefits have surfaced in consumer products. However, alongside legitimate claims, there is considerable misinformation too. Given the strong science based evidence for understanding melatonin, it makes them a good case study for automated fact-checking. Scientific claim verification itself is commonly evaluated at the abstract level (for instance, SciFact [13]), matching our solution design. We chose melatonin as a controlled vertical with abundant literature and concrete public health relevance use has risen [14, 15] making it appropriate to demonstrate a fact-checking pipeline before scaling the same architecture to other domains.
Specifically, This paper introduces ‘MelAnalyze,’ a computational framework that builds on generative and transformer based deep learning models formalized as a natural language inference (NLI) task to facilitate semi-automatic fact-checking of general melatonin claims. The framework integrates melatonin-related scientific abstracts from PubMed that serves as the basis for the evidence-based validation.
In this context, our key objectives and contributions are:
Comprehensive Framework: To develop a comprehensive framework for automated fact-checking, specifically tailored to melatonin-related claims.
Melatonin Specific Models: We fine tuned existing state of the art language models for the melatonin corpus.
Prototype Tool: To provide a prototype web-based tool for evaluating fact-checking algorithms, enabling end-users and researchers to assess the veracity of melatonin product claims using appropriate scientific evidence.
Beyond the immediate domain of melatonin-related claims, the ‘MelAnalyze’ framework has broader implications in being adapted to other domains as well. For instance, the ‘MelAnalyze’ framework can be easily extrapolated to evaluate other chemicals, mass consumer data, extracting insights into both positive and negative consumer experiences. Such insights can be useful in pharmaco-vigilance safety signal detection and in the triage of customer complaints. For consumer decision-making, recommendation systems significantly influence our choices [16]. However, there’s a risk of “algorithm over-dependence”, where consumers might place undue trust in these algorithm-generated recommendations, even when they might be flawed [17]. Coupled with aspects of human-computer interactions, persuasive technologies can influence purchasing decisions [18]. Together, their potential for disseminating misinformation or biased recommendations can have significant implications. By integrating a fact-checking layer, such as ‘MelAnalyze’, recommendation systems can provide more accurate and trustworthy suggestions. This ensures that consumers are not only receiving accurate product recommendations tailored to their preferences but also that these recommendations are grounded in verifiable facts. Ensuring that product claims and recommendations are validated against scientific evidence or verified facts can lead to enhanced consumer trust and better decision-making. We have illustrated some more of the impact areas in health in Fig. 1. Adaptation to other domains is straightforward, and it involves replacing the evidence database with a different domain-specific corpus, recomputing sentence-level embeddings, and optionally fine-tune the NLI on a modest domain-labeled set; the retrieval+NLI pipeline is unchanged.
Fig. 1.
Overview of the broader impact areas of science based fact checking in health
In summary, ‘MelAnalyze’ not only addresses the challenges of fact-checking melatonin-related claims but also showcases the broader applicability and necessity of automated fact-checking. In the subsequent sections, we delve deeper into background and related work, methods, results, and challenges.
Background and related work
In this section, we will lay the foundational concepts that are used in the “MelAnalyze” framework. The intention is for it to be a primer of the different aspects, we refer the interested reader to get more detailed information from the corresponding references.
Natural language inference
The Natural Language Inference (NLI) task consists of ascertaining the logical relationships between pairs of sentences. Given a premise sentence P and a hypothesis sentence H, the goal is to determine the nature of the relationship between them. Specifically, classifying if the relationship between P and H falls into one of three categories: “entailment,” “contradiction,” or “neutral”. “Entailment” signifies that the meaning of H can be logically inferred from P, indicating a stronger relationship between the two sentences. A “contradiction” indicates that H contradicts the information presented in P, reflecting a clear inconsistency. The “neutral” label implies that H neither entails nor contradicts P.
NLI models typically use a supervised learning approach. Given a labeled data set of premise-hypothesis pairs, the goal is to train a model that can generalize the relationship classification to unseen examples. The introduction of large-scale pretrained language models and its variants, has significantly advanced the state-of-the-art in NLI. The general approach involves pretraining the model on a massive corpus to learn contextualized representations of words and sentences. Subsequently, the model is fine-tuned on NLI-specific data sets to adapt its knowledge to the task at hand. During fine-tuning, the model learns to make accurate predictions. The model processes the premise and hypothesis sentences through its architecture and generates embeddings for both. These embeddings are then utilized for classification, often through a neural network layer. The final output layer provides the probabilities of the three classes (”entailment,” “contradiction,” and “neutral”), and the predicted label is chosen based on the highest probability.
Towards automatic fact checking
Fact checking involves the application of NLP techniques to automate the process of verifying the accuracy of claims, statements, or information. NLI serves as a fundamental framework for fact checking, enabling the assessment of the relationship between a claim and available evidence.
The utilization of NLI models for fact checking begins by inputting the claim into the system. The claim is transformed into their semantic representations, often in the form of embeddings as discussed before. Subsequently, the system compares this embedding against embeddings of potential evidence from reliable sources. Using cosine similarity as the similarity metric, the system ranks the evidence(s) based on their semantic proximity to the claim. Evidence(s) with smaller cosine distances are considered more relevant to the claim. The highest-ranked evidence(s) can then be presented to NLI algorithm, aiding them in making informed judgments about the claim’s veracity.
Automated fact checking have found applications in the domain of scientific evidence and bio-medicine. In the realm of scientific research and literature, NLI techniques have been employed to assess the compatibility of hypotheses with existing evidence, aiding researchers in formulating novel insights. It also plays a crucial role in combating misinformation in bio-medicine. It assists in verifying the accuracy of health-related claims, ensuring that medical advice and information disseminated to the public are evidence-based and reliable.
Transformers and large language models
In recent years, the introduction of transformers and large language models have boosted the field of NLP. These advancements have not only reshaped the landscape of NLP tasks but have also boosted Natural Language Inference (NLI) tasks with greater accuracy and efficiency.
Traditional approaches to NLP often relied on hand-crafted features and domain-specific knowledge, which limited their scalability and adaptability. The advent of transformers [19] marked a paradigm shift in NLP. Transformers employ self-attention mechanisms to weigh the significance of different words in a sequence, enabling them to capture contextual relationships regardless of word order. This innovation unlocked the potential to model long-range dependencies and capture nuanced linguistic patterns. One of the well known breakthroughs in large language models is BERT (Bidirectional Encoder Representations from Transformers) [5]. BERTs bidirectional training approach, in which the model learns from both left and right context, established new benchmarks across a range of NLP tasks. BERTs success inspired subsequent models, including RoBERTa [20], a robustly optimized BERT variant, and others like GPT-3 [6] which focused on generative tasks. Foundation LLMs are usually general purpose, but domain adaptation has been shown to yield better in domain performance in biomedicine and other knowledge-intensive tasks (such as BioBERT [21], PubMedBERT [22]; DAPT [23]).
In the context of NLI, which involves determining the relationship between two sentences (premise and hypothesis), both generative and discriminative models can be explored. Generative models aim to generate the label (entails, contradicts or neutral) when fine tuned, while discriminative models focus on predicting the label of the relationship between the hypothesis with respect to the premise. We formulate the fact checking problem as a NLI task as shown in Fig.2. More details are shared in further sections.
Fig. 2.
Problem formulation
Semantic similarity and vector search
Computing Semantic similarity is an important task in numerous NLP tasks, including NLI. In recent years, with advances in vectorization of text, cosine distance metrics are used to quantify the semantic similarity of text. Computing semantic similarity involves transforming textual data into a suitable vector representation (or embeddings) that captures the underlying semantic meaning. Sentence-BERT [24], which is based on BERT architecture is one of the good models for computing vector representations for sentences. In this process, each sentence s is encoded into a dense vector vs using the BERT model. The vectors are then normalized to unit length to ensure that the magnitude of the vector doesn’t affect the similarity calculation. Given two sentences s1 and s2, their cosine similarity is computed using the dot product of their normalized vectors:
![]() |
The model is trained with a contrastive loss function, encouraging similar sentences to be pulled together and dissimilar ones to be pushed apart in the vector space. This training strategy enables Sentence-BERT to capture intricate semantic nuances, thereby enabling more accurate semantic similarity comparisons. Once sentences are represented as vectors, the next step involves quantifying the semantic similarity between them. Cosine similarity is calculated as the cosine of the angle between two vectors and measures their similarity irrespective of their magnitude. The cosine similarity can be converted into a cosine distance by subtracting it from 1:
![]() |
A smaller cosine distance implies a higher degree of semantic similarity between vectors, while a larger distance indicates greater dissimilarity. As such, cosine distance serves as an effective measure for ranking and identifying the most semantically similar sentences in NLI tasks. By generating semantically rich embeddings, Sentence-BERT enables the model to grasp subtle nuances in sentence meaning. This, combined with the cosine distance metric, empowers the system to quantify the semantic gap between sentences with precision. For vector search, the computed semantic similarity scores can be utilized to identify related sentences. The vector search efficiently retrieves relevant sentences from a database, which can then be used with NLI algorithms for the verification of textual claims.
Melatonin and its importance
Melatonin is primarily known for its role in regulating the sleep-wake cycle and was first discovered in the pineal gland [25–29]. However, the presence of melatonin has been detected in multiple extrapineal tissues and has garnered substantial attention in recent years due to its potential health benefits beyond sleep management [30]. Melatonin is widely used as a dietary supplement to address various sleep disorders, jet lag, and even certain neurological conditions. However, its popularity has also led to a surge in marketing claims, often making claims that go beyond the established scientific understanding [31].
This is where fact checking comes in handy. With the help of a fact checking framework such as MelAnalyze, it becomes possible to scrutinize the information presented in marketing materials, product labels, and online content against reputable scientific sources. For instance, claims such as “Melatonin prevents all sleep disorders” or “Melatonin is completely risk-free” can be objectively assessed by fact checking algorithms against the existing body of biomedical literature. Moreover, the fact checking process can in principle also check details of melatonin’s mechanism of action, dosing recommendations, potential side effects, and interactions with other substances. This comprehensive evaluation ensures that consumers are provided with accurate and well-substantiated information, enabling them to make informed decisions about melatonin usage based on the best available evidence.
Data set preparation for training and validation
Data set for model training and testing
The accuracy of our melatonin NLI models relies on the quality and diversity of the training data. To construct a robust and specialized NLI model for melatonin-related claims, a detailed data preparation process was undertaken. This section outlines the steps involved in curating the data sets. Details of the whole process is shown in the Fig. 3.
Fig. 3.
Overview of data-sets used for training the best NLI model for melatonin. The best NLI model was finally used on amazon product reviews of melatonin
Expert-guided claim selection
Claims were curated to target testable statements about melatonin. Declarative claims linking melatonin to a defined outcome (e.g., sleep onset latency, jet lag, blood pressure), phrased so they can be checked against biomedical literature we used as inclusion criteria. We also took care that dosing-only tips, anecdotal narratives without a verifiable endpoint, non-melatonin substances, and near-duplicate paraphrases were excluded. A melatonin domain expert labeled each included claim as True or False and attached supporting PMIDs; these labels form the melatonin-specific supervision used for fine-tuning/evaluation. For data splits, expert-tagged melatonin claims were used in training/validation together with general NLI corpora (SNLI, SciNLI), while web claims from Amazon and Google were held out for external testing (counts in Table 2).
Table 2.
Overview of data sources and counts
| Data Source | Count |
|---|---|
| Synthetic Dataset | 9789 |
| Expert Tagged Dataset | 972 |
| Amazon product description | ~250 |
| Google Search | ~1000 |
Abstract-based statement extraction
Building on the expert-verified claims, melatonin-related abstracts were extensively analyzed. Employing a pattern-matching strategy, statements corresponding to the abstracts’ conclusions were identified [32]. These statements served as the foundation for subsequent data augmentation and paraphrasing steps. To enhance diversity, a pre-trained NLP paraphrasing model was employed to generate paraphrased statement [33] and negation statements [32].
Existing NLI data sets
To bolster the data set’s comprehensiveness, existing NLI data sets were incorporated. Specifically, the Stanford Natural Language Inference (SNLI) data set [34] and the SciNLI data set [35] were integrated. These data sets contributed a diverse range of general NLI instances, enriching the model’s ability to handle a wider spectrum of language structures and inferences. It is important to note that this data set was not melatonin specific.
Combining data sets and splitting for training and validation
The combination of these three distinct data sets resulted in a comprehensive training data set specifically tailored for melatonin-related NLI tasks. By combining the resulting training data encompassed specialized domain knowledge and also embraced the nuances of general NLI data sets. This combination aimed to strike a balance between domain-specific expertise and broader language understanding, fostering a versatile and well-informed melatonin-related NLI model. To facilitate effective model training and performance evaluation, a 80:20 split was employed.To limit over fitting, we kept all claims data as a strictly held out external test set (never used for training, validation, or model selection). We also report zero-shot variants of the same model families (Table 1, rows marked w/o.retraining) to contrast fine-tuned vs. base capability. Finally, the retrieval step presents source abstracts with PMIDs, externalizing knowledge rather than relying solely on parametric memory.
Table 1.
Performance metrics of various fine tuned and NLI models. Models fine tuned on LLaMA1 and RoBERTa performed the best. w/o.Retraining indicates zero-shot baselines; finetuned indicates models fine-tuned on the melatonin NLI training set
| Models | Approach | Training | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| bioBERT_v1.1_PubMed_nli_sts-w/o.retraining | NLI Specific | Base | 0.25 | 0.3 | 0.19 |
| bioBERT-nli-w/o.retraining | NLI Specific | Base | 0.22 | 0.31 | 0.15 |
| amoux/sciBERT_nli_squad.w/o.retraining | NLI Specific | Base | 0.3 | 0.46 | 0.3 |
| LLaMA1 | Generative | Fine tuned | 0.92 | 0.92 | 0.92 |
| LLaMA2 | Generative | Fine tuned | 0.91 | 0.91 | 0.91 |
| RoBERTa | Base | Finetuned | 0.91 | 0.91 | 0.91 |
| bioBERT_v1.1_PubMed_nli_sts | NLI Specific | Finetuned | 0.9 | 0.9 | 0.9 |
| bioBERT-nli | NLI Specific | Finetuned | 0.9 | 0.9 | 0.9 |
| amoux/sciBERT_nli_squad | NLI Specific | Fine tuned | 0.89 | 0.89 | 0.89 |
Real-world claims data set for testing
The internet has become a significant platform for consumers to gather information about various products, including melatonin. Online forums serve as spaces where individuals share their experiences and insights regarding melatonin. We curated a data-set from the internet for validating the usefulness of our models. Data for this analysis was obtained by conducting a Google Search using the keyword “Melatonin medicine”. We collected information related to melatonin medicine from various sources, including blogs and articles. Next, we utilized the MelanAnalyze framework, which identified claims that were found false by the system and provided supporting evidence for their categorization. The Taxila [36] tool was employed to streamline the collection procedure. This data set constituted real-world examples that could be subjected to empirical validation using the developed NLI model.
MelanAnalyze framework
Our proposed “MelAnalyze” framework as shown in Fig. 4 processes input claims through a series of steps. It begins with feature generation involving vector (embeddings) computation using the Sentence-BERT [24] algorithm. Similarly, we compute embeddings for every sentence from ~30,000 melatonin-related PubMed abstracts, which together form the evidence database. For each input claim, we retrieve the top-5 sentences by cosine similarity (embedding-based semantic similarity) and run the best NLI model on these retrieved sentences, generating a true/false or unclear assertion for every comparison. For presentation to end users, we display the full abstracts from which the matched sentences originate, with the supportive or refuting spans highlighted. To port the framework to a new domain, we swap the evidence database (domain corpus), recompute embeddings, and, if needed, fine-tune the NLI with a small domain-labeled set.
Fig. 4.
Overall framework of MelAnalyze
Natural language inference models considered
The process of identifying the optimal NLI model for integration into the MelanAnalyze framework included considering a combination of generative and discriminative models, as well as NLI-specific models. Generative models that were considered were LLaMA1 and LLaMA2, along with the discriminative model, RoBERTa. NLI-specific models, specifically bioBERT_v1.1_PubMed_nli_sts, bioBERT-nli, and amoux/sciBERT_nli_squad, were considered. We fine-tuned LLaMA1/2, RoBERTa, and all three NLI-specific BERT variants; entries marked w/o.retraining in Table 2 1 denote zero-shot baselines (no task-specific updates).
The process includes several key stages. First the NLI specific models are used as is in a zero shot setting to assess its effectiveness to be used out of the box. Next, each generative model is fine-tuned through prompt engineering, while RoBERTa is fine tuned using traditional fine tuning approaches. Subsequently, fine-tuning is performed on the NLI models too, yielding refined versions of the various models. Comprehensive evaluation is conducted to determine the efficacy of each fine-tuned model. This comprehensive assessment leads to the identification of the most suitable NLI model, which is then integrated into the MelanAnalyze framework to facilitate accurate and effective claim evaluation. Fig. 5 discussed the evaluation of different NLI models. Details of the different models are shown below.
Fig. 5.
Overview of the different NLI models considered and the fine tuning strategy applied
LLaMA1
LLaMA1 [37] (Large Language Model Meta AI) is the first generation of state-of-art foundational LLM designed by researchers at Meta and released on Feb 24, 2023. The LLaMA collection of language models ranges from 7 to 65 billion parameters in size, making it one of the most comprehensive language models. For this study, we have finetuned LLaMA1 under LORA [38] setting which helps in low resource usage and is comparatively faster.
LLaMA2
LLaMA 2 [39] is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. The architecture is very similar to the first LLaMA model, with the addition of Grouped Query Attention (GQA). The model is trained on 2 trillion tokens of data from publicly available sources. The pretraining setting and model architecture is adopted from LLaMA1. The training is done with QLORA [40] (Quantization-aware Low-Rank Adapter Tuning) setup. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance.
RoBERTa-base
RoBERTa [41] (Robustly Optimized BERT Approach) is a variant of BERT model, which was developed by Meta AI. It builds on BERT and modifies key hyperparameters, removing the next-sentence pre-training objective and training with much larger mini-batches and learning rates for longer duration. One key difference between RoBERTa and BERT is that RoBERTa was trained on a much larger data set and using a more effective training procedure. It achieved state-of-the-art performance on the MNLI, QNLI, RTE, STS-B, and RACE tasks.
BioBERT.V1.1.PubMed.nli.sts
BioBERT based bioBERT_v1.1_PubMed_nli_sts model is a BERT based binary model. This model is from HuggingFace. It was trained on PubMed NLI data set. Since its trained on a PubMed scientific corpus, we chose this model for our experiments
BioBERT-nli
This is the model BioBERT fine-tuned on the SNLI and the MultiNLI data sets using the sentence-transformers library to produce universal sentence embeddings. The model uses the original BERT wordpiece vocabulary and was trained using the average pooling strategy and a softmax loss. monologg/bioBERT_v1.1_PubMed is the base model used for finetuning from HuggingFace’s AutoModel.
SciBERT.Nli.squad
SciBERT_nli_squad is another nli BERT base model finetuned on scientific corpus. The model type is BERT with 12 attention heads and 12 hidden layers. The max embedding size is 512. Since its trained on a scientific data set, we chose this model for our experiments
User interface: web based tool
Our framework incorporates a user-friendly web-based prototype as shown in Fig. 6 that serves as an intuitive platform for users to actively engage with the automated fact-checking process. This interface is designed to offer a seamless experience, allowing users to input their claims and receive real-time fact-checking results. A distinctive feature of our tool lies in its provision of comprehensive evidence derived from scientific publications. In addition to delivering the results, we furnish users with contextual information concerning the evidence sources. This includes key details like the impact factor of the journal where the evidence was published, comprehensive journal information, and the unique PubMed ID (PMID) linked to the respective publication, we also display the full abstract corresponding to each retrieved sentence so users can review the broader context, the matched spans are highlighted. This enriched presentation of evidence not facilitates the evaluation of claim and enables users to assess the credibility and reliability of the underlying sources. By presenting this supplementary information, our interface offers users a holistic view, allowing them to make well-informed decisions based on both the fact-checking outcomes and the supporting scientific evidence. This thoughtful design promotes transparency, encourages thorough examination, and contributes to a more responsible approach to navigating and disseminating information. This paper reports the prototype’s design and evidence presentation, no field deployment or formal user study with end users was conducted in this study. We also make no additional claim on end-user usability.
Fig. 6.
Overview of the simple user interface for checking the veracity of claims using MelAnalyze. The center pane represents the main ui presented to the user. Once the user types the claim, in case the claim is true and backed by science (shown on left), the pertinent sentences in the abstracts are marked in green. Similarly if the claim is false (shown in right), the sentences refuting the claim are marked in red
Experimental results
The Table 2 presents a comprehensive overview of the experimental results obtained from different fine-tuned NLI models made for melatonin within the MelanAnalyze framework. These models are classified into two main categories: generative models and discriminative NLI-specific models. The performance of each model is assessed using three fundamental metrics: precision, recall, and F1-score, which collectively offer a comprehensive evaluation of their effectiveness.
In the generative models category, LLaMAv1 exhibited remarkable performance, achieving a precision of 0.92, recall of 0.92, and an F1-score of 0.92. LLaMAv2 also demonstrated competitive results, with a precision of 0.91, recall of 0.91, and an F1-score of 0.91. In the discriminative NLI-specific models category, the RoBERTa model displayed consistent precision, recall, and F1-score values, all set at an impressive 0.91. Similarly, both the bioBERT_v1.1_PubMed_nli_sts model and the bioBERT-nli model showcased robust performance, maintaining precision, recall, and F1-score values of 0.9 across the board. The precision-recall curve, depicted in Fig. 7, offers a visual representation of the results.
Fig. 7.
Precision recall curve for the various fine-tuned pre-trained models
Furthermore, the table highlights the performance of models in their base and fine-tuned iterations. Notably, the base version of bioBERT_v1.1_PubMed_nli_sts yielded lower precision, recall, and F1-score values of 0.25, 0.3, and 0.19, respectively. Similarly, the base version of bioBERT-nli achieved relatively lower precision, recall, and F1-score values of 0.22, 0.31, and 0.15, respectively. The base version of amoux/sciBERT_nli_squad too showed poor results. In contrast, all fine-tuned NLI models demonstrated competitive metrics with increased accuracy highlighting the power of fine-tuning the NLI models with new melatonin-specific data. Among these models, the ones with the highest performance are highlighted in bold (corresponding to LLaMAv1 and RoBERTa fine-tuned), underscoring its good capability to effectively assess the veracity of claims through the MelanAnalyze framework.
Experimental parameters
We fine-tuned all evaluated model families in Table 1, runs labeled w/o.retraining report zero-shot performance. For the LLaMA1 modes, fine-tuning was performed using the Instruction-tuning approach with LoRA setup. The tuning parameters details are as follows: learning_rate =
, batch_size = 128, micro_batch_size = 4, warm_iters = 100. The LoRA setup was kept default with lora_r = 8, lora_alpha = 16, and lora_dropout = 0.05. The maximum sequence length was set to 512. For LLaMA instruction tuning, the Lit-LLaMA repository was employed. The LLaMA2 model was fine-tuned using the QLORA approach, which aims to reduce the memory footprint of large language models during fine-tuning without compromising performance. The LoRA configuration based on the QLORA paper is as follows: lora_alpha = 16, lora_dropout = 0.1, r = 64, bias = “none”, task_type = “CAUSAL_LM”. Other parameters include num_epochs = 10, batch_size = 32, gradient_accumulation_steps = 2, optim = “paged_adamw_32bit”, learning_rate =
, bf16 = True, tf32 = True, max_grad_norm = 0.3, and warmup_ratio = 0.03. Both models were trained on an NVIDIA A40 GPU instance. LLaMA took 1.5 days to train, while LLaMA2-13B was completed in 5 days. Both models utilized around 31GB of GPU memory during training.
For the fine-tuning of BERT-based model RoBERTa and the other three SCI-NLI models, the following parameters were used: num_epochs = 5, train_batch_size = 8, and learning_rate =
. All BERT models were trained on an NVIDIA GeForce RTX 2080 Ti with 12GB of memory. The training process for each model took approximately 5–6 hours, utilizing around 5GB of GPU memory.
With current tooling, the pipeline is straightforward to operationalize: a pretrained encoder for embeddings, a vector store for sentence-level retrieval, and the NLI model behind the web prototype. Fine-tuning uses LoRA/QLoRA and ran on single-GPU hardware in this study; routine updates are incremental (embed new texts and append to the evidence database) rather than full retraining.
Data collection
The evidence knowledge base used for retrieval in this study consists of PubMed abstracts (last update October 2023). The evidence database is a pluggable component where additional text corpora (e.g., PubMed Central full texts, clinical guidelines or non-English sources with suitable embedding models) can be integrated using the same pipeline. The present evaluation reports only the PubMed abstracts configuration. We also restrict evidence to PubMed-indexed journal abstracts and treat them as sufficiently vetted for this feasibility study. We do not perform secondary critical appraisal or compute study-quality scores; instead, we expose the PMIDs and render full abstracts so readers can inspect the original sources. Detailed data collection, evidence database and training data statistics are provided in Table 2, 3, 4
Table 3.
Evidence database used for MelanAnalyze framework
| Evidence database (retrieval) | Details (this study) |
|---|---|
| Source | PubMed abstracts (PMID-linked) |
| Last update | October 2023 |
| Abstracts indexed | ![]() |
| Sentences embedded | All sentences from the above abstracts |
| Component design | Pluggable, source-agnostic |
Table 4.
Dataset split for training, validation, and testing
| Data Split | Count |
|---|---|
| Train | 277,139 |
| Validation | 2,484 |
| Test | 5,943 |
Empirical evaluation of the MelanAnalyze framework on claims on the internet
Here are several claims gathered from the internet and assessed by the MelAnalyze system. The claims listed below have been categorized as “False” by the system. Additionally, there are claims categorized as “True” included in the following list.
Conclusion, limitations and future directions
In summary, the combination of NLI models, semantic similarity, and automated fact-checking has enabled an effective approach for evaluating scientific claims, specifically for melatonin-related information. Utilizing advanced NLI models, we have demonstrated the feasibility of this method in assessing claim accuracy and combating misinformation. The evaluation targets model performance and evidence presentation, public-facing usability was not assessed as a part of this study.
Limitations of the study
MelAnalyze is scoped as a sentence-level, abstract based fact-checker to enable transparent, fast retrieval at scale. We embed every sentence from 30,000 PubMed abstracts and run NLI on the top-5 retrieved sentences per claim, while displaying the full abstracts for context. This design may miss details present only in full texts or guidelines and can be sensitive to embedding retrieval and NLI. We mitigate this by training with expert-tagged claims, highlighting evidence spans, and versioning the corpus (last update October 2023). Residual memorization cannot be fully excluded though. However, the external web claims test, the inclusion of zero-shot baselines, and retrieval with source display are intended to mitigate over-fitting. The framework can incorporate full texts, clinical guidelines, registries, and preprints via the same pipeline; the present study scopes evidence to PubMed abstracts. MelAnalyze is intended as decision-support to triage claims and surface evidence efficiently, complementing expert review. We did not benchmark outside melatonin in this study, cross-domain evaluations using domain-specific corpora and labels are potential future work.
Looking forward, integrating more refined NLI-specific models and tailored training data can enhance the precision of fact-checking. Additionally, incorporating contextual features and linguistic nuances could further improve the MelAnalyze capabilities for nuanced claim analysis. Beyond melatonin, our approach has broad applications across diverse domains. Because the components are modular, we can extend MelAnalyze to additional health domains by swapping the evidence database and reusing the same pipeline with light domain-specific fine-tuning.
Finally, we note that MelAnalyze does not attempt to establish causal relationships between interventions and clinical outcomes. The current framework verifies claims at the level of textual entailment, while causal inference requires dedicated study designs and statistical methods which are beyond the scope of our study [42, 43]. Incorporating causal inference approaches in future extensions could complement fact checking by enabling stronger conclusions about intervention effects.
Acknowledgements
The authors acknowledge the funding provided by Vectura Fertin Pharma Inc., for a portion of this study. The authors thank Dr. Julia Hoeng and Gordon Dawson from Vectura Fertin for their strategic advice and Guidance through out the course of this study. This work was originally done in October 2023.
Author contribution
SKP, SG, and GE Conceived the study. SKP designed the solution strategy. GE prepared the expert curated dataset. NK conducted the experiments and generated the results. All authors wrote and reviewed the paper.
Funding
The authors acknowledge partial funding provided by Vectura Fertin Pharma Inc., for a portion of this study.
Data availability
All data and tool information is available at the supplementary companion website https://bit.ly/melanalyze_tool.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Polyzou M, Kiefer D, Baraliakos X, Sewerin P. Addressing the spread of health-related misinformation on social networks: an opinion article. Front Med. 2023;10:1167033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Smith GD, Ng F, Li WHC. COVID-19: emerging compassion, courage and resilience in the face of misinformation and adversity. J Clin Nurs. 2020;29:1425 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.DeLorme DE, Huh J, Reid LN, An S. Dietary supplement advertising in the US: a review and research agenda. Int J Advertising. 2012;31:547–77 [Google Scholar]
- 4.Lewandowsky S, Ecker UK, Cook J. Beyond misinformation: understanding and coping with the “post-truth” era. J Appl Res In Memory And Cognition. 2017;6:353–69 [Google Scholar]
- 5.Devlin J, Chang M-W, Lee K, Toutanova KB. Pre-training of deep bidirectional transformers for language understanding. 2018. https://doi.org/arXiv:1810.04805. arXiv preprint
- 6.Brown T, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901 [Google Scholar]
- 7.Pandi-Perumal SR, et al. Physiological effects of melatonin: role of melatonin receptors and signal transduction pathways. Prog In Neurobiol. 2008;85:335–53 [DOI] [PubMed] [Google Scholar]
- 8.Claustrat B, Geoffriau M, Brun J, Chazot G. Melatonin in humans: a biochemical marker of the circadian clock and an endogenous synchronizer. Neurophysiologie Clinique/Clin Neurophysiol. 1995;25:351–59 [DOI] [PubMed] [Google Scholar]
- 9.Hardeland R. Melatonin and inflammation—story of a double-edged blade. J Pineal Res. 2018;65:e12525 [DOI] [PubMed] [Google Scholar]
- 10.Arendt J. Melatonin: characteristics, concerns, and prospects. J Biol Rhythms. 2005;20:291–303 [DOI] [PubMed] [Google Scholar]
- 11.Costello RB, et al. The effectiveness of melatonin for promoting healthy sleep: a rapid evidence assessment of the literature. Nutr J. 2014;13:1–17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Anderson G, Reiter RJ. Melatonin: roles in influenza, COVID-19, and other viral infections. Rev In Med Virol. 2020;30:e2109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wadden D, et al. Fact or fiction: verifying scientific claims. 2020. https://doi.org/arXiv:2004.14974. arXiv preprint
- 14.Li J, Somers VK, Xu H, Lopez-Jimenez F, Covassin N. Trends in use of melatonin supplements among US adults, 1999-2018. Jama. 2022;327:483–85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cohen PA, Avula B, Wang Y-H, Katragunta K, Khan I. Quantity of melatonin and cbd in melatonin gummies sold in the US. Jama. 2023;329:1401–02 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Konstan JA, Riedl J. Recommender systems: from algorithms to user experience. User modeling and user-adapted interaction. 2012;22:101–23
- 17.Banker S, Khetani S. Algorithm overdependence: how the use of algorithmic recommendation systems can increase risks to consumer well-being. J Public Policy Mark. 2019;38:500–15 [Google Scholar]
- 18.Fogg BJ. Persuasive technology: using computers to change what we think and do. Ubiquity. 2002;2002, 2
- 19.Vaswani A, et al. Attention is all you need. In: Advances in neural information processing systems, vol. 30. Curran Associates, Inc.; 2017 [Google Scholar]
- 20.Liu Y, et al. Roberta: a robustly optimized bert pretraining approach. 2019. https://doi.org/arXiv:1907.11692. arXiv preprint
- 21.Lee J, et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–40 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gu Y, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans On Comput For Healthcare (health). 2021;3:1–23 [Google Scholar]
- 23.Gururangan S, et al. Don’t stop pretraining: adapt language models to domains and tasks. 2020. https://doi.org/arXiv:2004.10964. arXiv preprint
- 24.Reimers N, Gurevych I. Sentence-bert: sentence embeddings using siamese bert-networks. 2019. https://doi.org/arXiv:1908.10084. arXiv preprint
- 25.Montague J. Melatonin mystery. In: New Scientist. Vol. 256. Publisher: Elsevier; 2022. p. 41–45 [Google Scholar]
- 26.Acuna Castroviejo D, et al. Melatonin-mitochondria interplay in health and disease. Current topics in medicinal chemistry. Vol. 11. Publisher: Bentham Science Publishers; 2011. p. 221–40 [DOI] [PubMed] [Google Scholar]
- 27.Acuña-Castroviejo D, et al. Extrapineal melatonin: sources, regulation, and potential functions. Cellular and molecular life sciences. Vol. 71. Publisher: Springer; 2014. p. 2997–3025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Brzezinski A. Melatonin in humans. In: New England journal of medicine. Vol. 336. Publisher: Mass Medical Soc; 1997. p. 186–95 [DOI] [PubMed] [Google Scholar]
- 29.Reiter RJ. Melatonin: clinical relevance. Best practice & research clinical endocrinology & metabolism. Vol. 17. Publisher: Elsevier; 2003. p. 273–85 [DOI] [PubMed] [Google Scholar]
- 30.Hardeland R, Pandi-Perumal SR, Cardinali DPM. The international journal of biochemistry & cell biology. Vol. 38. Publisher: Elsevier; 2006. p. 313–16 [DOI] [PubMed] [Google Scholar]
- 31.Reppert SM, Weaver DR. Melatonin madness. Cell. 1995;83:1059–62. 10.1016/0092-8674(95)90131-0 [DOI] [PubMed] [Google Scholar]
- 32.Bastan M, Surdeanu M, Balasubramanian N. BioNLI: generating a biomedical NLI dataset using lexico-semantic constraints for adversarial examples. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022:5093–5104. 10.18653/v1/2022.findings-emnlp.374.
- 33.Damodaran P. Parrot: paraphrase generation for nlu. 2021
- 34.Bowman SR, Angeli G, Potts C, Manning CD. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2015
- 35.Sadat M, Caragea C. SciNLI: a corpus for natural language inference on scientific text. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022, 7399–409, doi: 10.18653/v1/2022.acl-long.511
- 36.Ahmed SAJA, et al. Large scale text mining for deriving useful insights: a case study focused on microbiome. In: Frontiers in physiology. Vol. 13. Publisher: Frontiers; 2022. p. 933069 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Touvron H, et al. LLaMA: open and efficient foundation language models. corr, abs/2302.13971. 2023. . arXiv preprint arXiv.2302.13971
- 38.Hu EJ, et al. Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021
- 39.Touvron H, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 2023
- 40.Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023
- 41.Zhuang L, Wayne L, Ya S, Jun Z. A robustly optimized BERT pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics; 2021; Huhhot, China. p. 1218–1227. Chinese Information Processing Society of China.
- 42.Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183:758–64 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Hernan M, Robins J. Causal inference: What if. Chapman & Hall/CRC monographs on statistics & applied probab. publisher CRC Press; 2025 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data and tool information is available at the supplementary companion website https://bit.ly/melanalyze_tool.












