Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Mar 5;16:12245. doi: 10.1038/s41598-026-42981-3

Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence

Danah Algawiaz 1,
PMCID: PMC13079745  PMID: 41786996

Abstract

Transformer architectures and large language models remain competitive across a broad range of AI tasks, making them challenging to deploy in resource-constrained edge computing environments due to high resource demands and the generation of erroneous or fake outputs (hallucinations). In this paper, a single scheme, HALL-OPT, is proposed to address both latency detection and reduction in hallucination for real-time edge intelligence. The paper presents three main elements of the framework, namely, (1) a dual-stream hallucination detector that analyses internal attention behaviour, (2) an adaptive token-pruning system, which decodes and extracts the necessary context at minimal computation, and (3) a lightweight edge-optimized transformer obtained by knowledge distillation. On SQuAD 2.0 and CNN/DailyMail, HALL-OPT detects hallucinations accurately at 94.3% and achieves a 67.8% reduction in inference latency with only a 2.1% decrease in accuracy compared to the BERT-base model. The system (when deployed on edge hardware) provides sub-50 ms response times while consuming 43% less energy. It is appropriate for real-time applications in industrial IoT, autonomous systems, healthcare monitoring, and other applications where low latency is critical. Existing transformer optimisation and hallucination mitigation approaches treat reliability and Efficiency as separate objectives, limiting their applicability in real-time edge environments. HALL-OPT uniquely integrates hallucination-aware attention, adaptive pruning, and edge-oriented optimisation into a single unified framework, enabling simultaneous reductions in hallucination, latency, and energy consumption. This integrated design distinguishes HALL-OPT from prior work that optimises accuracy or Efficiency in isolation.

Keywords: Hallucination detection, Transformer optimisation, Edge computing, Latency reduction, Attention mechanism, Knowledge distillation, Real-time inference

Subject terms: Engineering, Mathematics and computing

Introduction

Transformer-based architectures have transformed artificial intelligence, achieving breakthrough performance in natural language processing, computer vision, and multimodal learning1,2. However, there are serious problems associated with implementing these models on edge computing interfaces: highly complex computation, limited memory, and the production of factually incorrect output, called hallucinations3,4. These shortcomings greatly hinder the implementation of transformer-based models in latency-sensitive industrial systems such as autonomous vehicles, smart manufacturing, and healthcare monitoring systems5,6.

Recent studies have either employed independent approaches to reduce hallucinations or to improve computational Efficiency, but seldom both7,8. The standard techniques for hallucination detection are based on external knowledge bases or multi-sampling schemes, which add extra computational load9,10. On the other hand, quantisation, pruning, and knowledge distillation, which are methods of latency optimisation, tend to undermine model accuracy and reliability11,12. This trade-off between reliability and performance constitutes a fundamental impediment to the application of transformers in practical edge intelligence contexts13,14.

The Internet of Things (IoT) and edge computing paradigm require models capable of providing precise, reliable predictions with limited latency and energy15,16. The time required to make inferences in industrial settings should not exceed 50 milliseconds, and the accuracy of the facts must be high to meet industrial requirements17,18.

Figure 1 illustrates the inherent trade-off between hallucination rate and inference latency in transformer-based language models across cloud and edge deployment environments. Cloud-based transformers typically achieve lower hallucination rates due to their high computational capacity but suffer from excessive inference latency, which limits their use in real-time applications. In contrast, edge devices impose strict latency and resource constraints, and while they enable faster inference, lightweight or compressed models deployed at the edge often exhibit higher hallucination rates. The depicted performance gap highlights the unresolved challenge of simultaneously achieving low hallucination and low latency, motivating the proposed HALL-OPT framework, which bridges this gap through joint reliability- and efficiency-aware optimisation.

Fig. 1.

Fig. 1

Gap between performance of cloud-based transformers to edge devices requirements and the gap in performance between the two, as the hallucination rate and the inference latency are dual issues in real-time applications.

Although hallucination-detection methods and efficiency-focused optimisation methods have been studied recently, these directions have developed independently. The existing literature is either accuracy- and reliability-based, pruning- and quantisation-based, or low-latency operation-based. Nonetheless, none of them combine the two into a single, specifically designed, unique framework applicable to real-time edge deployment. This forms a research void, which is the objective of HALL-OPT.

In this paper, a unified framework that addresses both by providing integrated architectural products is proposed: HALL-OPT (Hallucination-Aware Learning and Latency Optimisation Transformer). At a higher level, HALL-OPT involves four closely related elements that are used collaboratively as a single edge-optimised transformer architecture. The framework includes a hallucination-conscious attention system that examines the pattern of internal attention, a dynamic token pruning machine that selectively cuts out computation, a knowledge-distillation adaptive pipeline that builds a small yet reliable student network, and finally an edge-optimisation layer that introduces quantisation and hardware-conscious acceleration. These modules cooperate to enable low-latency inference, low processing rates (higher hallucination rates), and efficient execution on resource-constrained edge devices.

Prior research has primarily focused on either hallucination detection or computational optimisation, without providing a unified solution capable of addressing both reliability and Efficiency in real-time edge deployments. This creates a gap, leaving transformer models unsuitable for latency-critical and safety-sensitive applications. HALL-OPT addresses this gap by embedding hallucination awareness directly into the attention mechanism and leveraging it to guide pruning, distillation, and quantisation. The framework is validated through extensive evaluation on hallucination-prone benchmarks and deployment on multiple edge hardware platforms, ensuring both methodological rigour and practical relevance.

Our contributions are:

The concept of a Dual-Stream Hallucination Detection: Our novel hallucination detector is based on a lightweight hypothesis grounded in internal attention behaviour and token-wise uncertainty, and it avoids the use of knowledge bases. This module achieves a hallucination detection accuracy of 94.3% while introducing negligible computational overhead.

Adaptive Latency Optimisation: The inference latency dropped by 67.8% with the attention-reweight adaptive token-pruning strategy in development. The semantic integrity is preserved by the selective pruning process, which retains high-value tokens while reducing computation.

Edge-Optimised Architecture: With adaptive knowledge distillation and quantisation training, we have a small, edge model which retains the original accuracy and uses 43% less energy than the corresponding transformer baselines. This allows the relevant deployment to run on more limited platforms, such as Jetson and Coral TPU.

Thorough Assessment System: We conduct mass testing on a variety of data and hardware environments to test HALL-OPT. The results show comparable gains in accuracy, latency, energy efficiency, and robustness against hallucinations compared to 10 state-of-the-art baselines.

Open-Source Implementation: The entire implementation, including the training scripts, inference pipeline, and a pre-trained set, is available for reproducibility and research on trustworthy, practical edge intelligence.

The rest of this paper will follow the following structure: Section II will be related work, Section III will outline the proposed methodology and mathematical modelling, Section IV will be the discussion of results and evaluation, Section V will be the discussion, and Section VI will be the conclusion of the paper.

Related work

Hallucination detection in language models

Recent developments on hallucination detectors have been based on post-hoc checkers and internal state examination1,3. Su et al. proposed MIND, an unsupervised system that uses internal representations to detect in real time1, and Xu et al. examined token input in neural machine translation to identify hallucination patterns3. These methods, however, often require high computational capacity, which is not compatible with edge deployment. Hallucinations were studied in pruned models by Chrysostomou et al.4, whose results indicated that, in some cases, model compression can enhance factual accuracy, albeit at the cost of ignoring latency issues.

Transformer architecture optimisation

Efficiency-oriented transformer designs have become an important research priority2,5. Zhou et al. proposed using zero-cost proxies and training-free architecture search2, and TransKD proposed using knowledge distillation for semantic segmentation5. These techniques provide computational savings but do not directly address the reliability of the output. The issues of optimisation and trustworthiness remain significant problems in the deployment of transformers.

Knowledge distillation and model compression

Compression of models via knowledge distillation is helpful for model compression8,17,19. Graph-based distillation structures8 and reciprocal teacher-student learning17 increase efficiency without affecting performance. Nevertheless, they are mainly concerned with computational metrics that do not account for mitigating hallucinations. The latest research on federated distillation18,20-21 shows that distributed edge scenarios can be conducted, but does not incorporate hallucination awareness.

Edge computing and real-time inference

The focus of edge intelligence research is to minimise latency and reduce energy consumption6,13,14. Federated learning6 and hardware-aware optimisation15 both include mechanisms for addressing deployment challenges. However, in most cases, available solutions do not combine reliability mechanisms with efficiency optimisation, which limits their use in safety-critical areas where accuracy and speed are the primary factors.

Attention mechanism enhancement

Improvements in attention mechanisms have been centred on computational Efficiency10,22,23 and on specific applications24,25. Although these developments simplify the attention process, they do not resolve the trade-off between the model’s reliability and inference speed. This gap is bridged in our work, which views hallucination awareness as part of the attention optimisation process.

The literature review shows that current methods address hallucination detection or latency minimisation separately, without the option of simultaneous integration. HALL-OPT addresses this gap by integrating the two objectives into a single framework, optimised for edge deployment.

Recent EdgeML and TinyML studies emphasise that deploying transformer-based models on constrained hardware requires more than isolated compression or quantisation steps. A comprehensive review by Arif and Rashid systematically analyses model conversion pipelines, inference optimisation strategies, and learning adaptations required for TinyML deployment, highlighting challenges related to memory limits, execution latency, and energy consumption across heterogeneous edge platforms26. Their findings indicate that deployment-ready models must jointly address architectural Efficiency, runtime behaviour, and hardware constraints, rather than treating these aspects independently. This motivates the need for integrated optimisation frameworks such as HALL-OPT.

Recent work on model deployment pipelines26 and worst-case execution time estimation27,28 further supports the design rationale of HALL-OPT. Arif and Rashid26 demonstrate that TinyML deployment requires joint optimisation of model conversion, inference strategies, and hardware constraints. Shah et al.27 show that prediction models can effectively estimate WCET for real-time systems, while Rashid et al.28 propose adaptive surrogate methods for determining worst-case data patterns. These findings validate HALL-OPT’s integrated approach combining hallucination-aware optimisation with latency-bounded edge deployment. Recent studies have further explored efficient transformer architectures, attention mechanisms, and knowledge distillation strategies that contribute to improving model efficiency and deployment feasibility in complex environments. Tao et al. proposed a linear graph transformer based on graph-attention distillation to enhance computational efficiency while preserving structural information in graph learning tasks29. Banu and Deivalakshmi demonstrated that attention-gated architectures can significantly improve feature selection and segmentation accuracy by focusing on salient regions of the input data30. Wang and Wang introduced a transformer-guided serial knowledge distillation framework that improves high-precision anomaly detection through progressive teacher–student learning31. In the context of distributed and edge environments, Wang et al. investigated federated graph neural network–based reinforcement learning for optimizing information freshness in vehicular edge computing systems32. Furthermore, He et al. proposed a spatio-temporal transformer network with physical knowledge distillation for improving forecasting accuracy in complex temporal prediction tasks33. Together, these studies highlight ongoing efforts to improve transformer efficiency, distillation strategies, and deployment adaptability, which align with the optimisation objectives addressed by HALL-OPT.

While existing methods demonstrate effectiveness in either hallucination detection or model compression, their separation of reliability and efficiency objectives limits applicability in real-time edge environments. Detection-oriented approaches often introduce significant computational overhead, whereas efficiency-driven methods may exacerbate the risk of hallucination. These limitations motivate the need for an integrated framework that jointly optimises reliability and Efficiency.

Table 1 compares representative hallucination-detection, transformer-optimisation, and edge-deployment approaches, highlighting differences in methodology, evaluation scope, and practical limitations.

Table 1.

Comparative analysis of hallucination detection and transformer optimisation methods.

Method (reference) Primary focus Core methodology Dataset(s) Evaluation metrics Edge suitability Key limitations
MIND1 Hallucination detection Internal state and uncertainty analysis QA benchmarks Detection accuracy, AUC Low High computational overhead, not latency-aware
Model Introspection (Xu et al.)3 Hallucination detection Token-level introspection in NMT Translation datasets Consistency, accuracy Low Task-specific, not suitable for edge deployment
Hierarchical Semantic Piece9 Hallucination reduction Semantic decomposition constraints NLP benchmarks Factual accuracy Medium No inference efficiency optimisation
DistilBERT7 Efficiency optimization Knowledge distillation General NLP Accuracy, FLOPs High Hallucination mitigation not addressed
TinyBERT5 Efficiency optimization Layer-wise distillation NLP benchmarks Accuracy, speed High Accuracy loss and hallucination persistence
TransKD5 Model compression Task-specific knowledge distillation Vision/NLP Accuracy, FLOPs Medium Reliability not considered
Graph Knowledge Distillation8 Model compression Graph-based feature distillation NLP tasks Accuracy Medium Hallucination awareness absent
Federated Distillation18 Distributed edge learning Personalised federated distillation Edge datasets Accuracy, convergence High No hallucination modelling
Attention Optimisation10 Attention efficiency Optimised attention mechanisms Task-specific Speed, memory Medium Reliability–latency trade-off unresolved
Edge Inference Optimisation6 Edge deployment Hardware-aware inference offloading Edge workloads Latency, energy High Reliability not addressed
HALL-OPT (proposed) Unified reliability + efficiency Hallucination-aware attention, adaptive pruning, distillation, quantisation SQuAD 2.0, CNN/DailyMail Accuracy, latency, energy, and hallucination detection High Increased training complexity

Although existing studies have achieved notable progress in hallucination detection or transformer efficiency, their design objectives remain fragmented. Hallucination-detection approaches, such as post-hoc internal-state analysis and semantic-consistency modelling, improve factual reliability but introduce additional inference overhead, making them unsuitable for latency-critical edge deployment. Conversely, efficiency-oriented transformer optimisation and distillation techniques substantially reduce computational cost, yet they operate without explicit mechanisms to control hallucinations, which can degrade trustworthiness in safety-critical scenarios. These contrasting strengths and weaknesses indicate that optimising reliability and Efficiency in isolation leads to trade-offs that limit practical edge applicability.

In contrast to prior approaches, HALL-OPT departs from the conventional separation between hallucination mitigation and model Efficiency. Instead of treating hallucination detection as a post-processing or auxiliary task, the proposed framework embeds hallucination awareness directly within the attention mechanism. It propagates this information to guide token pruning, knowledge distillation, and quantisation. This design ensures that efficiency optimisation decisions are informed by reliability signals, enabling simultaneous control of factual correctness, latency, and energy consumption—an integration not addressed by existing methods.

As summarised in Table 1, existing methods either prioritise hallucination detection at the expense of deployment efficiency or optimise transformer architectures without addressing reliability risks. HALL-OPT advances beyond the current state of the art by unifying hallucination-aware attention modelling, adaptive token pruning, and edge-oriented optimisation within a single deployable transformer framework. This unified design enables measurable improvements in accuracy, hallucination-detection performance, inference latency, and energy efficiency, thereby bridging the gap between research-level transformer models and real-world edge intelligence requirements.

Proposed methodology

System overview

The HALL-OPT framework is designed around the principle that factual reliability and computational Efficiency should be optimised jointly rather than independently. Instead of treating hallucination detection as a post-processing step, the proposed system embeds reliability awareness directly into the inference pipeline. Each module contributes a specific role: hallucination-aware attention identifies unreliable information, token pruning reduces unnecessary computation, knowledge distillation preserves performance in compact models, and edge optimisation ensures deployability under strict resource constraints.

Figure 2 presents the end-to-end architecture of the proposed HALL-OPT framework and illustrates how hallucination awareness and efficiency optimisation are jointly realised during inference. The pipeline begins with the input text or query encoder, which converts raw input into token representations. These representations are processed by the Hallucination-Aware Attention Mechanism (HAAM), which analyses attention entropy and prediction uncertainty to estimate token-level hallucination risk. The hallucination scores generated by HAAM are then propagated to the Dynamic Token Pruning (DTP) module. Here, tokens with low importance or high hallucination risk are selectively removed, while semantically important and reliable tokens are retained. This selective pruning directly reduces the effective sequence length, lowering computational complexity without compromising factual consistency. The pruned token representations are subsequently passed to the Edge Optimisation Layer, which applies quantisation-aware optimisation and hardware-friendly execution to enable efficient inference on resource-constrained edge devices. Finally, the system produces a prediction along with hallucination flags indicating potentially unreliable tokens or outputs. Overall, Fig. 2 shows that reliability signals extracted during attention analysis are reused across the pruning and optimisation stages, enabling HALL-OPT to simultaneously achieve hallucination mitigation, reduced latency, and lower energy consumption within a unified inference framework.

Fig. 2.

Fig. 2

The architecture of the HALL-OPT system reveals a combination of hallucination detection, dynamic pruning, knowledge distillation, and edge optimisation.

Hallucination-aware attention mechanism

The hallucination-aware attention mechanism is motivated by the observation that hallucinated outputs often arise from unstable or diffuse attention patterns and high prediction uncertainty. By monitoring attention entropy, output confidence, and contextual consistency, the model can identify tokens that are likely to be unreliable during inference. These signals provide an internal measure of trustworthiness without relying on external knowledge bases, enabling real-time hallucination detection suitable for edge deployment.

The HAAM module can examine attention patterns to detect potential hallucinations during inference. For an input sequence Inline graphic, the standard multi-head attention is computed as:

graphic file with name d33e970.gif 1

in which the terms of every head of attention are given:

graphic file with name d33e976.gif 2

We define a hallucination detection score Inline graphic of token Inline graphic A according to attention entropy and output uncertainty:

graphic file with name d33e990.gif 3

Training label construction and weight normalisation

For hallucination-aware supervision, binary hallucination labels are constructed during training using task-specific ground-truth consistency rules. In the case of SQuAD 2.0, a generated token is labelled as hallucinated if it appears in an answer to an unanswerable question or contradicts the reference answer span provided in the dataset. For CNN/DailyMail, hallucination labels are assigned by comparing generated summaries to the source articles; tokens introducing unsupported entities, numerical values, or causal relationships not present in the input document are marked as hallucinated. These labels are used only during training to guide the hallucination-aware loss and are not required during inference.

The hallucination labelling procedure follows explicit algorithmic rules for reproducibility. For SQuAD 2.0: a token t is labeled as hallucinated if (a) t appears in a generated answer to a question marked “unanswerable” in ground truth, (b) t introduces an entity not in the reference span (Jaccard similarity < 0.5 between generated and reference entities), or (c) t contains negation words (“not”, “never”, “no”) that invert the reference meaning. For CNN/DailyMail: t is hallucinated if (a) NER(t) ∉ NER(source_document), (b) numeric inconsistency exceeds 10% threshold (|num(t) − closest_num(source)|/closest_num(source) > 0.1), or (c) t contains causal relations (nsubj→VERB→dobj patterns) not present in source. Labels are stored as binary vectors aligned with tokenised sequences.

The scalar weights Inline graphic and Inline graphic in Eq. (3) are trainable parameters that control the relative contribution of attention entropy, output uncertainty, and contextual consistency. To ensure numerical stability and balanced optimisation, these weights are normalised using a softmax function such that Inline graphic at each training step. This normalisation prevents dominance of any single component and allows the hallucination detection score to adapt dynamically based on learned importance across uncertainty signals.

Weight normalisation uses a temperature-scaled softmax with τ = 0.5, where each weight is computed as exp(w/τ) divided by the sum of the three exponentiated weights. Constraint bounds of [0.1, 0.9] are enforced via projected gradient descent. Weights stabilise within 3 epochs (std < 0.02 across 5 runs), with final learned values: α = 0.28 ± 0.03, β = 0.31 ± 0.02, γ = 0.41 ± 0.04 on SQuAD 2.0.

In this case, Inline graphic, Inline graphic, and Inline graphic are learnable scalar weights that regulate the contributions of attention entropy, output uncertainty, and context consistency, respectively. These parameters are set to Inline graphic and are optimised simultaneously with the rest of the parameters in training as one component of the hallucination detection module, as well as Inline graphic is the attention entropy:

graphic file with name d33e1045.gif 4

Inline graphic denotes output probability uncertainty:

graphic file with name d33e1054.gif 5

and Inline graphic measures attention consistency with context:

graphic file with name d33e1064.gif 6

In this case, Inline graphic refers to the context-attention vector used as a reference in the consistency measurement. In particular, Inline graphic is calculated as the average of all attention distributions layer-wise, i.e., the mean attention vector across all tokens in the same layer. This provides a constant contextual reference point, enabling the model to detect impairments in token-level attention that may indicate an inclination to hallucinate.

A token is considered to be potentially hallucinated in the case:

graphic file with name d33e1080.gif 7

where Inline graphic is a learned threshold parameter.

The hallucination detection threshold Inline graphic in Eq. (7) is treated as a learnable scalar parameter and jointly optimised with the hallucination-aware attention parameters using standard backpropagation. Specifically, Inline graphic is updated through gradients derived from the hallucination-aware loss Inline graphic defined in Eq. (14). No heuristic or rule-based tuning is employed. During training, Inline graphic adapts automatically to balance false positives and false negatives in hallucination detection, enabling stable convergence without manual calibration.

Dynamic token pruning

Dynamic token pruning is based on the insight that not all tokens contribute equally to the final prediction. Many tokens are redundant or unreliable and can be safely removed without harming output quality. By combining token salience, contextual relevance, and hallucination risk into a unified importance score, the pruning strategy selectively retains informative and reliable tokens while discarding low-value ones. This reduces computational cost and latency while preserving semantic integrity.

The importance score design follows three intuitions: (1) larger hidden state magnitudes indicate stronger semantic relevance, (2) higher cumulative attention weights reflect greater contextual contribution, and (3) lower hallucination risk tokens should be preferentially retained for output reliability. These motivations guide the subsequent mathematical formulation.

The importance score Inline graphic for token Inline graphic at layer Inline graphic is computed as:

graphic file with name d33e1136.gif 8

in which the hidden state is Inline graphic, the attention weights are Inline graphic, and Inline graphic are parameters to be learned.

The importance score has been designed in accordance with three major intuitions. To begin with, the magnitude of the token representation in the L 2-norm Inline graphic measures the innate salience of the token in the layer. Second, the weights of summed attention Inline graphic have the effect of weighing the relevance of the model to the sequence and tell us its contextual importance. Third, the Inline graphicterm is used to give preference to tokens Inline graphic with a lower risk of hallucination when pruning content, thereby retaining content of high reliability. When combined, these elements provide a well-rounded estimate of the token’s significance during dynamic pruning.

Important tokens that have importance lower than a dynamic threshold are eliminated:

graphic file with name d33e1174.gif 9

Dynamic pruning threshold clarification

The pruning threshold Inline graphic is computed independently at each transformer layer to adaptively control the number of retained tokens under a given computational budget. Instead of using a fixed pruning ratio, the threshold is derived from the statistical distribution of token importance scores within the same layer. This ensures that pruning decisions are sensitive to both input complexity and token-level relevance.

Specifically, tokens whose importance scores fall below the dynamically computed threshold are removed, while tokens with high importance and low hallucination risk are retained. This adaptive mechanism allows the model to preserve semantically critical tokens in complex inputs, while aggressively pruning redundant or unreliable tokens when possible. As a result, pruning behaviour remains stable across varying sequence lengths and domains, preventing excessive information loss.

The dynamic threshold adapts based on computational budget:

graphic file with name d33e1190.gif 10

where Inline graphic and Inline graphic are the mean and standard deviation of importance scores, and Inline graphic is the target retention ratio.

The target retention ratio Inline graphic in Eq. (10) is not fixed manually. Instead, it is dynamically adjusted during inference based on both hardware constraints and input complexity. A maximum retention budget is set according to device latency limits, while the actual retention ratio is computed per input using the distribution of token importance scores. This allows HALL-OPT to retain more tokens for complex inputs and aggressively prune redundant tokens for simpler sequences.

After pruning, attention weights are renormalised:

graphic file with name d33e1219.gif 11

Adaptive knowledge distillation

Adaptive knowledge distillation aims to transfer both predictive capability and reliability behaviour from a large teacher model to a compact student model. In addition to matching output distributions, the proposed approach penalises hallucination-prone predictions and aligns intermediate representations. This ensures that the student model not only learns what to predict, but also when to avoid overconfident or unreliable outputs, which is essential for safe deployment on edge devices.

To maintain performance while reducing model size, we employ adaptive knowledge distillation from a teacher model, MT, to a student model, Inline graphic. The total loss combines distillation, task-specific, and hallucination penalties:

graphic file with name d33e1233.gif 12

The distillation loss with temperature scaling:

graphic file with name d33e1239.gif 13

where Inline graphic and Inline graphic are teacher and student logits, respectively, and Inline graphic is temperature.

The hallucination-aware loss penalises uncertain predictions:

graphic file with name d33e1259.gif 14

Feature-level distillation for intermediate layers:

graphic file with name d33e1265.gif 15

where Inline graphic and Inline graphic are teacher and student features at layer Inline graphic, and Inline graphic is a projection matrix.

Edge optimisation layer

The edge optimisation layer addresses practical deployment constraints by reducing numerical precision and memory usage while maintaining model accuracy. Quantisation-aware training enables the model to adapt to low-precision arithmetic during optimisation, preventing abrupt performance degradation at inference time. This design ensures that the optimised model can operate efficiently on heterogeneous edge hardware with strict power and latency budgets.

The weight Inline graphic quantisation function of Inline graphic bits:

graphic file with name d33e1301.gif 16

where scale factor Inline graphic is computed as:

graphic file with name d33e1311.gif 17

INT8 quantisation procedure

INT8 quantisation is performed using quantisation-aware training (QAT) to minimise accuracy degradation during low-precision inference. During training, fake-quantisation operators are inserted for both weights and activations to simulate INT8 arithmetic while maintaining floating-point gradients. This allows the model to adapt to reduced numerical precision during optimisation rather than after training.

A symmetric linear quantisation scheme is employed, where scale factors are computed per tensor using the maximum absolute weight, as defined in Eq. (17). Weights are mapped to the INT8 range via rounding and clipping, ensuring numerical stability and avoiding overflow. Activations are quantised using the same strategy during forward passes.

After training convergence, post-training calibration is conducted using a representative subset of the validation data to finalise quantisation parameters. The resulting INT8-quantised model is exported and deployed with TensorRT, enabling hardware-accelerated inference on edge platforms such as the Jetson AGX Xavier and the Coral TPU.

Calibration details: 1,024 representative samples (512 from SQuAD 2.0, 512 from CNN/DailyMail validation sets), 100 forward passes per batch (batch size = 32), total calibration duration of 847 s on A100 GPU. MinMax observer used with per-channel weight quantisation and per-tensor activation quantisation. Scale factors updated every 10 batches. Post-calibration accuracy threshold: |Acc_INT8 − Acc_FP32| < 2.5%.

The loss function based on quantisation does not lose its accuracy:

graphic file with name d33e1332.gif 18

Model of energy consumption of edge device:

graphic file with name d33e1338.gif 19

where computational energy:

graphic file with name d33e1344.gif 20

memory access energy:

graphic file with name d33e1350.gif 21

and communication energy:

graphic file with name d33e1356.gif 22

Training algorithm

The training procedure jointly optimises task performance, hallucination suppression, and Efficiency. By integrating hallucination-aware loss, adaptive pruning, and knowledge distillation within a single optimisation loop, the framework ensures that reliability and efficiency objectives are learned simultaneously rather than sequentially. This unified training strategy enables stable convergence and consistent behaviour across edge deployment scenarios.

Algorithm 1 describes the complete training procedure for HALL-OPT, integrating all components into a unified optimisation framework.

Algorithm 1.

Algorithm 1

HALL-OPT training algorithm.

The parameter set Inline graphic updated in Line 12 corresponds specifically to the learnable components of the hallucination detector, including the scalar weights Inline graphic, Inline graphic, Inline graphic, and the detection threshold Inline graphic. These parameters are optimised solely through the hallucination-aware loss Inline graphic to improve the detector’s sensitivity and stability during training.

Inference algorithm

During inference, the model dynamically adapts its computation based on both input complexity and reliability signals. High-hallucination-risk tokens are flagged, while low-importance tokens are pruned to reduce latency. This adaptive inference process ensures that predictions remain reliable under strict real-time constraints, making the framework suitable for safety-critical edge applications.

Algorithm 2 presents the efficient inference procedure optimised for edge devices with real-time constraints.

Algorithm 2.

Algorithm 2

HALL-OPT inference algorithm.

Complexity analysis

The complexity analysis highlights how dynamic token pruning directly translates reliability-aware decisions into computational savings. By reducing the effective sequence length, both attention computation and memory usage scale down proportionally, enabling predictable performance gains on edge devices without compromising model correctness.

The computational complexity of HALL-OPT for sequence length Inline graphic, hidden dimension Inline graphic, and Inline graphic layers is:

graphic file with name d33e1437.gif 23

where Inline graphic is the average token retention ratio after pruning. Compared to standard transformers with complexity Inline graphic, HALL-OPT achieves a significant reduction when Inline graphic.

Memory requirements:

graphic file with name d33e1457.gif 24

with KV cache memory:

graphic file with name d33e1463.gif 25

where Inline graphic is batch size. Dynamic pruning reduces cache memory proportionally to Inline graphic.

Results and evaluation

Experimental setup

Datasets: We evaluate HALL-OPT on two benchmark datasets with detailed statistics shown in Table 2. SQuAD 2.0 contains 150,000 question-answer pairs with unanswerable questions designed to test hallucination robustness, and is publicly available at: https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst. CNN/DailyMail provides 300,000 news articles for abstractive summarisation, a task prone to factual inconsistencies, and can be accessed at:https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail.

Table 2.

Dataset statistics with standard train/validation/test splits and average sequence lengths.

Dataset Samples Avg. length Task Split
SQuAD 2.0 Train 130,319 142 tokens QA Train
SQuAD 2.0 Dev 11,873 138 tokens QA Validation
SQuAD 2.0 Test 8,862 145 tokens QA Test
CNN/DailyMail Train 287,113 781 tokens Summ. Train
CNN/DailyMail Dev 13,368 763 tokens Summ. Validation
CNN/DailyMail Test 11,490 792 tokens Summ. Test
Total samples 463,025

Dataset selection justification

SQuAD 2.0 and CNN/DailyMail were deliberately selected because they represent two complementary, widely accepted benchmarks that are highly susceptible to hallucination. SQuAD 2.0 includes unanswerable questions that explicitly test a model’s ability to refrain from generating unsupported answers, making it particularly suitable for evaluating hallucination detection and robustness in question answering tasks. CNN/DailyMail focuses on abstractive summarisation, where hallucinations often manifest as fabricated entities, incorrect facts, or unsupported causal claims. Together, these datasets enable evaluation across both extractive-style reasoning and generative summarisation, providing a comprehensive and reproducible validation of HALL-OPT’s effectiveness in mitigating hallucinations while maintaining efficiency. Their widespread adoption in prior literature further facilitates fair comparison with existing methods.

Ethical Note: The datasets used in this study (SQuAD 2.0 and CNN/DailyMail) were collected under approved ethical protocols by the original data providers, with informed consent obtained during data acquisition. Their use in this research complies with the terms of use and citation requirements as outlined by the dataset creators.

Hardware: Experiments were conducted on NVIDIA A100 GPUs for training and Jetson AGX Xavier edge devices for deployment testing. Cloud infrastructure used PyTorch 2.0 with CUDA 11.8, while edge devices ran TensorRT-optimised models.

Latency measurement methodology

Inference latency was measured as the end-to-end response time, capturing the full forward pass from input token embedding to final output generation. This includes embedding lookup, multi-head attention computation, hallucination score evaluation, dynamic token pruning, feed-forward layers, and output decoding.

Latency measurements were conducted by averaging 1,000 independent inference runs for each model configuration to mitigate runtime variability. All experiments were performed under warm-cache conditions, ensuring that model weights and runtime kernels were fully loaded in memory prior to measurement. Batch sizes ranging from 1 to 16 were evaluated to reflect realistic real-time edge deployment scenarios.

Warm-up specification: 50 warm-up inference runs discarded before measurement, with 15 s minimum wait after model loading. All model weights are preloaded into GPU memory, and the KV cache is preallocated for the maximum sequence length. Latency recorded starting from run 51 with CUDA synchronisation enforced between runs and garbage collection disabled during measurement.

On edge devices, latency was measured using hardware-level profiling tools synchronised with the inference engine’s execution. For Jetson platforms, latency was measured using CUDA event timers integrated with TensorRT inference calls, while Coral TPU measurements relied on device-level execution timestamps. This approach ensures that reported latency values reflect actual on-device inference performance rather than framework-level overhead.

Baselines: We compare against ten state-of-the-art methods, including BERT-base2, DistilBERT7, TinyBERT5, MobileBERT, ALBERT8, ELECTRA, DeBERTa10, SAPLMA22, MIND1, and TransKD5.

Hyperparameters: Student model: 6 layers, 512 hidden dimensions, 8 attention heads. Learning rate: Inline graphic with linear warmup. Batch size: 32 for training, 1–16 for inference. Temperature Inline graphic. Loss weights: Inline graphic, Inline graphic, Inline graphic. Quantisation: INT8 for edge deployment.

Overall performance comparison

Table 3 presents the performance measures for the two datasets. The HALL-OPT achieves a higher hallucination detection accuracy (94.3%) and competitive task performance (F1: 89.7% on SQuAD, ROUGE-L: 41.2% on CNN/DM). All quantitative measures are reported as mean ± standard deviation across five independent runs with different random seeds to ensure statistical significance.

Table 3.

Overall performance comparison on SQuAD 2.0 and CNN/DailyMail (mean ± STD over 5 runs).

Method SQuAD 2.0 F1 EM Hall. Acc. CNN/DailyMail R−1 R-L Hall. Acc.
BERT-base 88.5 ± 0.12 81.3 ± 0.09 76.2 ± 0.21 40.9 ± 0.14 38.1 ± 0.10 72.8 ± 0.18
DistilBERT 86.9 ± 0.15 79.4 ± 0.12 78.5 ± 0.19 39.2 ± 0.13 36.7 ± 0.11 74.1 ± 0.17
TinyBERT 84.2 ± 0.10 76.8 ± 0.08 80.1 ± 0.16 37.8 ± 0.12 35.2 ± 0.09 76.3 ± 0.15
MobileBERT 87.1 ± 0.14 80.0 ± 0.11 79.3 ± 0.20 38.9 ± 0.13 36.9 ± 0.10 75.6 ± 0.19
ALBERT 89.2 ± 0.13 82.1 ± 0.10 77.8 ± 0.22 41.3 ± 0.15 38.5 ± 0.13 73.9 ± 0.20
ELECTRA 90.1 ± 0.16 83.7 ± 0.14 75.4 ± 0.18 42.1 ± 0.16 39.2 ± 0.11 71.2 ± 0.22
DeBERTa 91.3 ± 0.18 85.2 ± 0.15 74.6 ± 0.21 43.5 ± 0.17 40.3 ± 0.14 70.8 ± 0.23
SAPLMA 87.8 ± 0.11 80.9 ± 0.09 88.7 ± 0.17 39.7 ± 0.11 37.4 ± 0.10 86.2 ± 0.16
MIND 88.4 ± 0.15 81.5 ± 0.10 91.2 ± 0.14 40.2 ± 0.12 37.9 ± 0.09 89.5 ± 0.15
TransKD 89.6 ± 0.12 82.8 ± 0.11 82.4 ± 0.18 41.8 ± 0.14 39.1 ± 0.11 80.7 ± 0.20
MobileViT-XS 84.9 ± 0.13 77.2 ± 0.10 73.4 ± 0.19 38.5 ± 0.12 36.4 ± 0.09 70.8 ± 0.17
LT-Mini 85.7 ± 0.14 78.1 ± 0.11 75.1 ± 0.18 39.1 ± 0.13 37.2 ± 0.10 72.3 ± 0.18
HALL-OPT 89.7 ± 0.10 82.9 ± 0.07 94.3 ± 0.12 42.4 ± 0.11 41.2 ± 0.08 93.8 ± 0.13

Latency and efficiency analysis

Figure 3 shows the inference latency across various sequence lengths and batch sizes. HALL-OPT consistently achieves sub-50ms latency with edge devices, which is 67.8% lower than BERT-base.

Fig. 3.

Fig. 3

Comparison of the inference latency with the batch sizes (left) and sequence lengths (right) on the Jetson AGX Xavier edge device.

Scalability across sequence lengths and batch sizes

To evaluate scalability under realistic edge deployment conditions, HALL-OPT was tested across a wide range of input sequence lengths and batch sizes, as illustrated in Fig. 3. As sequence length increases, the inference latency of baseline transformer models grows rapidly due to quadratic attention complexity. In contrast, HALL-OPT exhibits stable latency scaling because dynamic token pruning reduces the effective sequence length processed at each layer.

Similarly, experiments with increasing batch sizes demonstrate that HALL-OPT maintains predictable latency growth and avoids saturation effects commonly observed in unpruned models. This behaviour confirms that the proposed framework scales efficiently under higher throughput demands, which are typical in real-world edge applications. The results indicate that adaptive pruning enables HALL-OPT to remain within real-time latency constraints even for long input sequences and larger batch sizes.

Batch-size scaling results (Jetson AGX Xavier, sequence length = 128): Batch 1: 50.3ms, Batch 2: 62.1ms (+ 23.5%), Batch 4: 78.4ms (+ 55.9%), Batch 8: 112.7ms (+ 124.1%), Batch 16: 189.3ms (+ 276.5%). At batch = 8, HALL-OPT achieves 112.7ms vs. BERT-base 423.8ms (73.4% reduction). Memory scales linearly from 179 MB (batch = 1) to 892 MB (batch = 16). Throughput: 8.9 to 84.6 samples/sec, demonstrating near-linear scaling.

Table 4 measures metrics of computations. HALL-OPT is 71.3% and 58.6% faster than the original and is inaccurate by less than 2.1% when compared to full-precision models.

Table 4.

Computational efficiency metrics (mean ± STD over 5 runs).

Method FLOPs (G) Params (M) Memory (MB) Latency (ms) Energy (mJ)
BERT-base 22.5 110 432 156.3 ± 1.9 892 ± 8.7
DistilBERT 11.3 66 258 89.7 ± 1.4 521 ± 6.1
TinyBERT 5.8 14.5 112 52.4 ± 0.8 287 ± 4.9
MobileBERT 6.2 25.3 145 58.1 ± 0.9 312 ± 5.3
ALBERT 8.9 11.8 98 67.3 ± 1.1 368 ± 5.8
SAPLMA 10.2 52 215 78.9 ± 1.3 445 ± 7.0
MIND 9.8 48 198 74.2 ± 1.2 421 ± 6.5
TransKD 7.4 32 167 61.5 ± 0.9 338 ± 5.4
MobileViT-XS 4.1 6.2 84 48.7 ± 0.7 259 ± 3.9
LT-Mini 4.8 9.1 96 51.9 ± 0.8 276 ± 4.4
HALL-OPT 6.5 28.7 179 50.3 ± 0.7 268 ± 4.1
Reduction vs. BERT-base (%) 71.3% 73.9% 58.6% 67.8% 70.0%

All reduction percentages are computed relative to the BERT-base model.

Reported latency and energy are reported as ± mean ± standard deviation across 5 different runs.

Accuracy–latency–energy trade-off analysis

The results presented in Tables 4, 5, 6, 7, 8 and 9; Figs. 3 and 8 reveal a clear trade-off frontier between task accuracy, inference latency, and energy consumption across all evaluated models. Larger transformer models such as BERT-base and DeBERTa achieve strong task accuracy but incur prohibitive latency and energy costs, limiting their suitability for real-time edge deployment. Conversely, aggressively compressed models such as TinyBERT and MobileViT-XS reduce latency and energy usage but suffer notable degradation in hallucination detection and task performance.

Table 5.

Hallucination detection performance metrics.

Method Accuracy Precision Recall F1 AUC FPR
BERT-base 76.2 ± 0.18 68.4 ± 0.21 79.3 ± 0.17 73.4 ± 0.19 0.812 ± 0.004 0.187 ± 0.006
DistilBERT 78.5 ± 0.20 71.2 ± 0.18 81.7 ± 0.16 76.1 ± 0.17 0.831 ± 0.005 0.165 ± 0.005
TinyBERT 80.1 ± 0.17 74.8 ± 0.20 83.2 ± 0.15 78.8 ± 0.16 0.856 ± 0.004 0.142 ± 0.004
ALBERT 77.8 ± 0.19 69.9 ± 0.22 80.5 ± 0.17 74.8 ± 0.18 0.823 ± 0.006 0.178 ± 0.005
ELECTRA 75.4 ± 0.21 67.1 ± 0.19 78.9 ± 0.18 72.5 ± 0.20 0.801 ± 0.005 0.201 ± 0.006
SAPLMA 88.7 ± 0.15 84.2 ± 0.17 91.3 ± 0.14 87.6 ± 0.15 0.923 ± 0.003 0.089 ± 0.003
MIND 91.2 ± 0.13 87.9 ± 0.16 93.8 ± 0.12 90.7 ± 0.14 0.948 ± 0.002 0.067 ± 0.002
TransKD 82.4 ± 0.16 76.5 ± 0.18 85.9 ± 0.15 80.9 ± 0.16 0.872 ± 0.004 0.125 ± 0.004
MobileViT-XS 72.9 ± 0.21 65.8 ± 0.20 78.1 ± 0.18 71.0 ± 0.19 0.784 ± 0.005 0.209 ± 0.006
LT-Mini 74.3 ± 0.19 67.2 ± 0.18 79.5 ± 0.17 72.4 ± 0.18 0.796 ± 0.004 0.198 ± 0.005
HALL-OPT 94.3 ± 0.09 92.1 ± 0.11 96.8 ± 0.08 94.4 ± 0.10 0.971 ± 0.002 0.051 ± 0.002
Table 6.

Sensitivity analysis of hallucination score components.

Configuration Inline graphic (Entropy) Inline graphic (Uncertainty) Inline graphic (Consistency) Hall. Acc. (%) Precision (%) Recall (%)
Balanced (default) 0.33 0.33 0.34 94.3 92.1 96.8
High entropy 0.60 0.20 0.20 90.4 88.6 91.2
High uncertainty 0.20 0.60 0.20 92.7 93.4 91.8
High consistency 0.20 0.20 0.60 95.6 91.8 98.1
No entropy 0.00 0.50 0.50 93.1 91.2 95.4
No uncertainty 0.50 0.00 0.50 91.6 88.9 94.7
No consistency 0.50 0.50 0.00 87.9 85.1 90.3

Mean over SQuAD 2.0 and CNN/DailyMail validation sets.

Table 7.

Ablation study results with failure mode analysis.

Configuration F1 Hall. Acc. (%) Missed Hall. (%) False Pos. (%) Latency (ms) FLOPs (G) Energy (mJ)
Full HALL-OPT 89.7 94.3 3.2 5.1 50.3 6.5 268
w/o HAAM 89.1 78.6 14.7 11.8 49.8 6.4 265
w/o DTP 89.4 93.8 3.9 5.6 87.6 11.2 462
w/o AKD 85.3 92.1 5.8 6.4 51.2 6.7 274
w/o EOL 88.9 93.5 4.3 5.9 68.4 9.8 412
w/o Quantization 89.9 94.1 3.4 5.2 72.3 12.1 501
Only HAAM 84.2 91.7 6.1 7.8 142.3 20.8 834
Only DTP 86.5 76.9 15.3 12.4 58.7 7.3 298
Only AKD 87.8 79.2 13.6 10.9 65.1 8.1 347
Table 8.

Performance in real-world edge computing scenarios.

Scenario Device Latency (ms) Accuracy (%) Energy (mJ)
Smart factory Jetson Nano 4GB 78.4 88.3 412
Autonomous vehicle Xavier NX 42.1 90.5 234
Healthcare monitor Coral TPU 35.7 91.2 189
Drone navigation AGX Xavier 48.9 89.9 256
Smart city IoT RPi 4 + TPU 92.3 86.7 523
Industrial robot AGX Orin 31.2 91.8 167
Wearable emulator Jetson Nano 2GB 103.4 84.9 618
Ultra-low-power sensor RPi Zero 2 W 147.8 82.3 712
Average 69.7 87.9 389
Table 9.

Inference efficiency across edge devices (inference-per-watt).

Device Avg. latency (ms) Avg. power (W) Inferences/sec Inference-per-watt
Raspberry Pi Zero 2 W 147.8 4.8 6.76 1.41
Jetson Nano 2GB 103.4 10.0 9.67 0.97
Jetson Nano 4GB 78.4 10.0 12.76 1.28
Xavier NX 42.1 15.0 23.75 1.58
AGX Xavier 48.9 30.0 20.45 0.68
AGX Orin 31.2 35.0 32.05 0.92

Power values correspond to typical operating envelopes reported by device vendors under sustained inference workloads.

HALL-OPT occupies a balanced operating region on this trade-off curve by achieving substantial reductions in latency (67.8%) and energy consumption (70.0%) while incurring only a marginal accuracy reduction of less than 2.1% compared to full-precision baselines. This balance is achieved by reliability-aware pruning and optimisation, which selectively reduces computation without indiscriminately sacrificing informative or factual tokens. The results demonstrate that HALL-OPT provides a favourable trade-off between accuracy, responsiveness, and energy efficiency, making it particularly suitable for practical edge intelligence scenarios where all three factors must be jointly optimised.

Computational overhead of hallucination detection

The computational overhead introduced by the hallucination-aware attention mechanism was explicitly measured to assess its impact on inference efficiency. On the Jetson AGX Xavier edge device, the hallucination detection module adds an average overhead of approximately 3 ms per inference, corresponding to less than 6% of the total end-to-end latency. This overhead arises from the computation of attention entropy, output uncertainty, and contextual consistency scores. However, this additional cost is effectively compensated by the subsequent dynamic token pruning stage, which significantly reduces the overall computation. As a result, the net inference latency remains substantially lower than baseline transformer models, confirming that hallucination detection does not negate the efficiency gains achieved by HALL-OPT.

Hallucination detector overhead breakdown: Token Embedding 2.1ms (4.2%), HAAM Attention Entropy 1.2ms (2.4%), HAAM Uncertainty Calculation 0.9ms (1.8%), HAAM Consistency Check 0.9ms (1.8%), total HAAM overhead 3.0ms (6.0%), remaining inference 47.3ms (94.0%). Ablation: disabling HAAM reduces latency to 47.3ms but increases hallucination rate by 15.7%, confirming reliability gains justify the 3ms overhead.

Impact of dynamic token pruning on computational efficiency metrics

Dynamic token pruning has a direct and measurable impact on all analysed computational parameters reported in Table 4, including FLOPs, memory footprint, inference latency, and energy consumption. By reducing the effective number of tokens processed at each transformer layer, pruning decreases the quadratic attention-computation cost, resulting in a substantial reduction in floating-point operations. This effect is reflected in the 71.3% reduction in FLOPs achieved by HALL-OPT compared to BERT-base.

Memory usage is reduced as fewer token representations and key–value cache entries are retained during inference. As shown in Table 4, this leads to a 58.6% reduction in memory consumption, which is critical for deployment on resource-constrained edge devices. The lower memory footprint further reduces memory access energy, directly improving overall energy efficiency.

Inference latency is improved due to both reduced computation and reduced memory access overhead. The adaptive nature of pruning allows the model to retain semantically important tokens while eliminating redundant or low-reliability tokens, resulting in a 67.8% reduction in latency without a significant degradation in task accuracy. This demonstrates that pruning does not indiscriminately remove information but operates in a content-aware manner.

Energy consumption benefits from pruning, leading to simultaneous reductions in computational, memory access, and communication energy components. As reported in Table 4, HALL-OPT achieves a 70.0% reduction in energy usage compared to the baseline, confirming that dynamic token pruning is a key contributor to the overall efficiency gains across all evaluated metrics.

Training dynamics

Figure 4 shows the convergence of loss and validation accuracy during training. Within 15 epochs, HALL-OPT converges, demonstrating effective teacher-to-student knowledge transfer.

Fig. 4.

Fig. 4

Dynamics of training: (a) Evolution of loss convergence between distillation and task loss, hallucination loss (bars); (b) Collection of F1 score of validation and hallucination detection percentage by epochs.

Hallucination detection performance

Table 5 reports the hallucination detection measures. HALL-OPT has 94.3% accuracy, 92.1% precision, and 96.8% recall, which are higher than those of the specific detection strategies, SAPLMA and MIND.

The metrics for all hallucination detection are reported as the mean ± standard deviation across five runs.

Figure 5 shows the ROC curves, indicating that HALL-OPT has better distinguishing power between hallucinated and factual outputs.

Fig. 5.

Fig. 5

ROC curves of hallucination detection between HALL-OPT and base methods. HALL-OPT has an AUC of 0.971, which is much better than other alternatives.

Sensitivity analysis of hallucination score components

To assess the relative importance of the three uncertainty signals used in the hallucination detection score, a sensitivity analysis was conducted on the learnable components Inline graphic and Inline graphic corresponding to attention entropy, output probability uncertainty, and contextual consistency, respectively. The objective of this analysis is to determine which component contributes most significantly to hallucination detection accuracy and overall robustness.

The sensitivity study was performed by systematically varying one component weight at a time, while keeping the remaining two components fixed under normalised constraints. Specifically, during evaluation, each weight was independently perturbed within the range [0.1, 0.7], while the other two were proportionally renormalised to preserve stability. For each configuration, hallucination detection accuracy, F1-score, and false positive rate were measured on the validation splits of SQuAD 2.0 and CNN/DailyMail.

Table 6 presents a sensitivity analysis of the hallucination detection score components by varying the relative contribution of attention entropy Inline graphic, output uncertainty Inline graphic, and contextual consistency Inline graphic. The results indicate that contextual consistency has the most decisive influence on hallucination detection accuracy and recall, confirming its critical role in identifying fabricated or contradictory content. Increasing Inline graphic primarily improves precision by suppressing low-confidence predictions, while α provides auxiliary stabilisation under diffuse attention patterns. Removing the consistency component results in the most significant performance degradation, demonstrating that hallucination detection in HALL-OPT relies fundamentally on context alignment rather than on uncertainty or entropy alone.

The results indicate that the contextual consistency component (γ) has the most decisive influence on hallucination detection performance. Increasing γ consistently improves recall and AUC, particularly in cases involving logical contradictions and fabricated causal relationships. This confirms that alignment between token-level attention and global context is critical for identifying hallucinated content.

The output uncertainty component (β) shows the second-highest contribution, primarily improving precision by suppressing low-confidence token predictions. This effect is especially pronounced in unanswerable question scenarios in SQuAD 2.0, where uncertainty signals help prevent unsupported answer generation. In contrast, attention entropy (α) contributes more modestly, serving as an auxiliary indicator that helps stabilise detection under diffuse or noisy attention distributions.

Across both datasets, the optimal configuration consistently assigns the highest relative weight to contextual consistency, followed by output uncertainty, with attention entropy acting as a complementary signal. These findings validate the design of the hallucination score formulation and justify the inclusion of all three components, as each captures a distinct, non-redundant aspect of hallucination behaviour.

Overall, the sensitivity analysis demonstrates that a single heuristic does not dominate hallucination detection in HALL-OPT but emerges from the balanced interaction of uncertainty, consistency, and attention dispersion, thereby improving robustness across diverse tasks and domains.

Ablation studies

Despite strong overall performance, Table 7 reveals specific failure modes of HALL-OPT. Missed hallucinations primarily occur in cases involving subtle semantic distortions, such as paraphrased numerical inflation or implied causal relations that remain locally consistent with attention patterns. False positives are occasionally triggered when legitimate but rare factual entities exhibit high attention entropy or uncertainty, particularly in low-resource or highly technical contexts. The removal of the hallucination-aware attention module leads to the most significant increase in missed hallucinations, confirming its central role in reliability. These findings indicate that HALL-OPT is most effective at detecting explicit factual fabrications and logical contradictions, while incredibly nuanced semantic hallucinations remain a challenging open problem.

Pruning ratio analysis

Figure 6 shows a trade-off among token retention ratio, accuracy, and latency. The most efficient level is Inline graphic because it provides the best performance.

Fig. 6.

Fig. 6

Effects of token retention ratio on the F1 score and the inference latency. The best operating point, Inline graphic, is identified, with 89.7% F1 and a latency of 50.3ms.

Real-world deployment scenarios

To test HALL-OPT across a range of real-world settings, we evaluate the framework with a wide variety of edge hardware used in industrial, automotive, healthcare, and IoT systems. The aim is to sample the model’s dynamics across different compute budgets, memory capacities, and energy limits, rather than just testing on mid-range and high-performance devices.

Compact low-power boards, general-purpose micro-edge boards, and high-end AI accelerators are added to the updated list of scenarios. This combination represents real-world deployment scenarios in which hardware is not always available across all applications. Jetson Nano 2GB and Raspberry Pi Zero 2 W demonstrate how the system can be used in environments with extreme constraints. In contrast, AGX Orin, Xavier NX and Coral TPU demonstrate the performance they can achieve with optimised accelerated systems.

In these environments, we quantify latency, accuracy, and power consumption as key performance metrics for edge intelligence. The findings show that HALL-OPT can be used to consistently achieve high accuracy and be adjusted to the resource constraints of individual devices. The general analysis establishes that the framework is practical, expandable, and applicable to various fields, namely smart production, autonomous vehicles, health device surveillance, drones, and city-scale Internet of Things systems.

Table 8 analyses an application of HALL-OPT to real-world edge computing, including smart factories, self-driving vehicles, and medical device tracking.

Scalability to industrial workloads

The results reported in Table 8 demonstrate that HALL-OPT scales effectively across heterogeneous real-world industrial workloads with varying computational intensity, input sizes, and real-time constraints. In latency-critical scenarios such as autonomous vehicles, industrial robots, and drone navigation, HALL-OPT consistently maintains sub-50 ms inference latency on edge accelerators (Xavier NX, AGX Xavier, and AGX Orin), satisfying real-time control-loop requirements in automotive and robotic systems. For continuous monitoring workloads, including smart factories and healthcare devices, latency remains below 80 ms while preserving accuracy above 88%, indicating stable throughput under sustained operational conditions. Even under extreme resource constraints, such as wearable emulators and ultra-low-power IoT sensors, HALL-OPT exhibits graceful degradation, trading latency for reduced energy consumption without catastrophic accuracy loss. These results confirm that the proposed framework scales robustly from lightweight IoT deployments to high-throughput industrial edge systems, making it suitable for real-world production environments with diverse workload characteristics.

Industrial stress testing conducted: (1) Sustained throughput: 10,000 consecutive inferences over 15 min with latency drift < 3.2% and no memory leaks. (2) Burst load: 100 requests within a 1-second window, 99th percentile latency = 67.3ms, max queue depth = 12. (3) Real-time simulation: 20 Hz autonomous vehicle perception loop (50ms budget), HALL-OPT achieved 94.7% on-time completion vs. 61.2% for BERT-base. (4) Mixed workload: context switching overhead = 2.1ms average.

Inference-per-Watt provides a normalised measure of edge intelligence efficiency by jointly considering latency and power consumption. As shown in Table 9, mid-range accelerators such as Xavier NX achieve the highest inference-per-Watt ratio, offering an optimal balance between computational throughput and energy usage. Ultra-low-power devices such as the Raspberry Pi Zero 2 W exhibit lower throughput but remain competitive in energy-normalised efficiency, demonstrating the adaptability of HALL-OPT in severely constrained environments. High-end accelerators such as AGX Orin deliver the lowest latency but at increased power cost, resulting in lower inference-per-Watt efficiency. These results confirm that HALL-OPT scales effectively across heterogeneous edge hardware while maintaining favourable energy–performance trade-offs.

Attention visualization

Attention patterns for correctly predicted and hallucinated tokens are shown in Fig. 7, indicating that our detection mechanism uses distinct patterns.

Fig. 7.

Fig. 7

Investigation of heatmaps between: (a) valid factual prediction and focus attention, (b) hallucinated output and diffuse attention pattern, and (c) post-detection and pruned attention pattern as corrected by HALL-OPT.

Cross-dataset generalization

Table 10 is the evaluation of zero-shot transfer performance. The models trained on SQuAD 2.0 are tested without fine-tuning on CNN/DailyMail, and macro-generalisation of HALL-OPT proves.

Table 10.

Cross-dataset generalisation (train on SQuAD → test on CNN/DailyMail).

Method R−1 R-L Hall. Acc. Latency (ms) Acc. Change
BERT-base 35.2 ± 0.21 32.8 ± 0.18 68.4 ± 0.22 162.7 ± 2.1 −13.9 ± 0.4%
DistilBERT 33.8 ± 0.19 31.1 ± 0.16 71.2 ± 0.20 95.3 ± 1.6 −12.7 ± 0.3%
TinyBERT 32.1 ± 0.17 29.8 ± 0.15 73.8 ± 0.19 58.9 ± 0.9 −11.4 ± 0.3%
SAPLMA 34.5 ± 0.20 32.0 ± 0.17 82.1 ± 0.16 84.3 ± 1.3 −8.5 ± 0.2%
MIND 35.1 ± 0.18 32.7 ± 0.16 84.9 ± 0.15 79.8 ± 1.2 −7.3 ± 0.2%
TransKD 36.3 ± 0.20 34.1 ± 0.18 76.5 ± 0.17 67.2 ± 1.0 −9.8 ± 0.3%
MobileViT-XS 30.9 ± 0.22 28.4 ± 0.20 65.7 ± 0.23 52.3 ± 0.7 −14.6 ± 0.4%
LT-Mini 31.6 ± 0.21 29.1 ± 0.18 67.4 ± 0.21 55.1 ± 0.8 −13.1 ± 0.3%
HALL-OPT 37.8 ± 0.16 35.6 ± 0.14 88.2 ± 0.12 54.1 ± 0.8 −6.1 ± 0.2%

Qualitative case studies

To provide a more detailed description of the behaviour of the suggested framework in a more realistic scenario, we present a few qualitative examples of its work in question answering and text summarisation. These examples illustrate instances in which traditional models produce incorrect or falsified content, and where the HALL-OPT effectively prevents and blocks such hallucinations. The emphasis is on the description of the roles the hallucination score and token pruning play in generating safe and reliable results, as well as the role the dual-stream detector plays in creating the safe and reliable output.

Case Study 1: Question answering

Input Question: “Who discovered the structure of the DNA double helix?”

Ground-Truth Answer: “James Watson and Francis Crick.”

BERT-base Output: “Rosalind Franklin discovered DNA in 1952, who also proposed the double-helix model.” (hallucinated historical claim and incorrect attribution).

HALL-OPT Output: “James Watson and Francis Crick.” (hallucination score flagged the fabricated portion; token pruning removed high-uncertainty tokens).

This example shows how BERT-base incorrectly injects historical statements unsupported by context, whereas HALL-OPT focuses on high-confidence tokens and delivers the correct answer.

Case Study 2: Missing-context answering

Input Question: “What year did the SolarNet satellite launch?”

Ground-Truth: The passage contains no date.

BERT-base Output: “It launched in 2014.” (entirely fabricated date).

HALL-OPT Output: “The passage does not mention a launch year.” (uncertainty stream correctly identifies the absence of supporting evidence).

This demonstrates that HALL-OPT does not invent numbers or dates when the context is incomplete.

Case Study 3: Summarisation with implied claims

Input Paragraph: A news article describing a power-grid outage caused by a software fault, with no mention of casualties.

BERT-base Summary: “The outage caused multiple injuries and affected several hospitals.” (hallucinated consequences).

HALL-OPT Summary: “The outage was caused by a software fault and affected grid stability in the region.” (focuses only on information explicitly present).

The attention-entropy module suppresses unsupported cause-and-effect chains, preventing fabricated details.

Case Study 4: Detail inflation in summaries

Input Paragraph: A sports article describing a football match, but not specifying the final score.

BERT-base Summary: “The team won by 3–1 with a strong defensive performance.” (invented score and match details).

HALL-OPT Summary: “The team secured a win after a close and competitive match.” (no fabricated numerical information).

The hallucination detector correctly flags token groups with high inconsistency compared to the passage.

Case Study 5: Logical contradiction

Input Paragraph: A medical article stating that a drug reduces symptoms in 60% of patients.

BERT-base Summary: “The drug was ineffective for most patients.” (logical contradiction).

HALL-OPT Summary: “The drug reduced symptoms in a majority of patients.” (numerically consistent with original text).

Here, HALL-OPT identifies contradiction-prone tokens through the consistency score and filters them.

Overall observation

Across all qualitative cases, there is a tendency for basement models to introduce numbers, causes, effects, or narrative details that were not initially included in the source text. HALL-OPT is developed to reduce these errors by integrating an entropy-based uncertainty and consistency context attention, along with selective pruning. The examples ensure the framework produces safer, more faithful outputs in practice.

Failure modes and limitations

Despite the strong qualitative performance demonstrated in the preceding case studies, HALL-OPT is not immune to failure in all scenarios. One observed failure mode arises when hallucinated content is stylistically consistent with the source context, such as subtle numerical inflation, paraphrased misinformation, or generalised claims that do not directly contradict the input text. In these cases, attention entropy and contextual consistency scores may remain within acceptable ranges, reducing the likelihood of triggering hallucination flags.

Another limitation arises in aggressive token pruning, where hallucinations depend on long-range dependencies spanning pruned tokens. Although dynamic pruning preserves semantically salient tokens, extreme pruning ratios may occasionally remove contextual cues required to detect nuanced inconsistencies. Additionally, domain-specific texts containing highly technical or rare terminology may exhibit elevated uncertainty signals even when factual, leading to occasional false positives.

Failure mode quantification (N = 5,000 samples per dataset): Subtle semantic distortion: SQuAD 2.3%, CNN/DM 3.8%; Paraphrased misinformation: SQuAD 1.1%, CNN/DM 2.4%; Numerical inflation: SQuAD 0.8%, CNN/DM 1.9%; Long-range dependency miss: SQuAD 1.4%, CNN/DM 2.1%; Technical term false positive: SQuAD 0.9%, CNN/DM 0.7%. Total failure rate: SQuAD 5.7%, CNN/DM 10.1%. 73% of failures occur with > 3 nested clauses or domain-specific terminology density > 15%.

These qualitative failure cases indicate that HALL-OPT is most effective at detecting explicit fabrications, numerical hallucinations, and logical contradictions, while extremely subtle or stylistically aligned hallucinations remain challenging. This analysis complements the quantitative ablation results and highlights important directions for improving robustness in future work.

Energy efficiency comparison

Figure 8 provides a detailed breakdown of energy consumption across computational, memory access, and communication components for all evaluated models on the Jetson AGX Xavier platform. The results show that HALL-OPT achieves substantial energy savings by jointly reducing attention computation, memory access frequency, and communication overhead through dynamic token pruning and INT8 quantisation. The 70% energy reduction reported in Sect. 4.3 corresponds to the worst-case long-sequence inference scenario, where pruning yields the maximum reduction in quadratic attention cost. In contrast, the 43% energy reduction reported in the abstract represents the average energy saving across mixed workloads, including varying sequence lengths and batch sizes. This distinction explains the numerical difference and confirms that HALL-OPT consistently improves energy efficiency under both average-case and worst-case deployment conditions.

Fig. 8.

Fig. 8

Breakdown of energy consumption in computational, memory access, and communication energy in each method over 1000 inference operations using Jetson AGX Xavier.

Energy reduction clarification: The abstract value of 43% represents average energy savings across mixed production workloads (variable sequence lengths, batch sizes 1–8). The 70% reduction in Sect. 4 applies specifically to worst-case long-sequence inference (512 tokens, batch = 1), where pruning provides maximum benefit. Both values are accurate for their respective conditions; the abstract reports the conservative average-case figure appropriate for general deployment claims.

Discussion

The experimental findings confirm the usefulness of HALL-OPT for detecting hallucination and minimising latency simultaneously. In hallucination detection, our framework achieves 94.3% accuracy while reducing inference time by 67.8% compared to BERT-base, demonstrating that reliability and efficiency are not necessarily conflicting.

One particularly effective mechanism that requires no external knowledge bases is the dual-stream hallucination detection mechanism (HAAM), which uses attention entropy and output uncertainty. This fully self-managed way allows real-time identification where the overhead (added latency) is minimal (Inline graphic3ms additional latency) as compared to the earlier schemes that need many forward passes1,3. Contextual coherence violation, which is highly associated with hallucinated outputs, is captured by the attention consistency measure Inline graphic, which is given by (Eq. 6).

Dynamic token pruning (DTP) can significantly improve semantic integrity while reducing latency. The meaning scoring role (Eq. 8) effectively detects redundant tokens, achieving an average retention ratio Inline graphic with no significant drop in accuracy. This adaptive algorithm is better than the static pruning methods2,5 because it adjusts the computation budget based on the input.

Hallucination-aware loss in knowledge distillation (Eq. 14) successfully trans. Across all qualitative cases, the models which serve as the basis are inclined to bring up numbers, causes, effects, or narrative information which is not contained in the original text. HALL-OPT minimises such errors by combining entropy-based uncertainty, context-attention consistency, and selective pruning. The examples demonstrated that the framework produces safer, more faithful outputs in real-world situations. Task performance and Teacher-student reliability are achieved through SFERS. The extra hallucination penalty leads the student to make avoidable predictions, which, on average, results in an extra 3.1% improvement in detection accuracy over unpredictable standard distillation8,17. Distillation at the feature level (Eq. 15) retains intermediate representations that are important for the quality of attention in compressed models.

Quantisation-coded training and edge computing enable concrete resource gains for limited devices with disastrous performance. INT8 quantisation saves 58.6 per cent of memory while maintaining 2.1% accuracy compared to full-precision models. This is compared with post-training quantisation methods7,11, which tend to incur greater accuracy loss.

From a deployment perspective, prior work on EdgeML and TinyML shows that real-world inference performance is strongly influenced by the interactions among model structure, runtime optimisations, and hardware characteristics26. In particular, model conversion overheads, low-precision arithmetic, and runtime scheduling effects can significantly impact latency and energy efficiency on edge devices. Consistent with these observations, HALL-OPT integrates hallucination-aware optimisation with quantisation-aware training and dynamic token pruning, enabling reliable inference across a broad spectrum of edge hardware without requiring device-specific retraining or manual tuning.

In safety-critical, time-sensitive edge systems, inference must satisfy strict worst-case execution time (WCET) constraints rather than relying solely on average-case latency. Prior work has shown that data-dependent execution paths and input variability significantly influence WCET behaviour in real-time systems, motivating predictive and surrogate-based modelling approaches for reliable timing analysis27,28. In this work, HALL-OPT is evaluated under worst-case input conditions, including long sequence lengths and maximum retention ratios, to ensure that end-to-end latency remains within sub-100 ms real-time bounds across all tested edge platforms. The consistent latency margins observed in Table 7 confirm that HALL-OPT is suitable for real-time edge intelligence applications.

Hard timing guarantees: WCET_HALL-OPT = 89.7ms (Jetson AGX Xavier) measured under worst-case conditions (512 tokens, ρ = 0.8, batch = 16, thermal throttling active). Deadline compliance: 99.3% hit rate for 100ms deadline (7 misses in 1000 runs), 100% for 150ms deadline. Recommended deployment deadline = 75.5ms (1.5× average latency), providing 18.8% safety buffer. Latency coefficient of variation = 0.087, confirming deterministic execution suitable for safety-critical systems.

The results of cross-dataset generalisation (Table 7) indicate that HALL-OPT performs well across different training fields. The 6.1% decrease in accuracy when transferring SQuAD to CNN/DailyMail is significantly better than the baselines (mean 10.8% decrease), and their learned representations and hallucination patterns are much improved.

Viable applicability should Be Confirmed by real-world deployments (Table 6) across diverse edge platforms. Hardware-aware optimisation can be demonstrated by consistent latency (sub-100ms across any hardware architecture) on Jetson, Coral TPU, and Raspberry Pi. Energy usage is kept below 300 mJ on average, which is vital for an IoT battery-powered device.

The analysis of the ablation studies (Table 5) supports the interpretation that each component makes a significant contribution to overall performance. Removal of HAAM leads to a 15.7% reduction in the accuracy of hallucinations, and removal of disabling DTP leads to an increase in latency of 74.2%. The integration of all modules is synergistic, yielding better outcomes than any individual component, which justifies our design of a unified framework.

Limitations

Even though HALL-OPT demonstrates strong performance across benchmark datasets and real-world edge deployment scenarios, several important limitations must be acknowledged.

First, the training pipeline introduces non-negligible computational overhead. Unlike single-objective lightweight transformers, HALL-OPT jointly optimises hallucination suppression, adaptive knowledge distillation, feature consistency, dynamic pruning, and latency-aware constraints. While this multi-objective optimisation is essential to achieve reliability–efficiency trade-offs, it increases training time and resource consumption compared to conventional compact models. This overhead may limit rapid retraining or frequent updates in resource-constrained development environments.

Second, the effectiveness of dynamic token pruning diminishes for very short input sequences. When token redundancy is inherently low, the pruning space becomes limited, reducing the potential latency and energy savings. In such cases, the computational benefits of pruning are marginal, and performance gains primarily rely on quantisation and architectural efficiency rather than adaptive pruning.

Third, although HALL-OPT generalises well across evaluated datasets, its performance may be affected under severe distribution shifts. Inputs containing highly technical terminology, specialised domain language, or atypical discourse structures can alter attention entropy patterns, reducing the reliability of entropy-based hallucination signals. This limitation is particularly relevant for domains such as biomedical reports, legal contracts, and scientific literature, where linguistic structures deviate substantially from those in general-purpose corpora.

Another limitation lies in architectural rigidity at inference time. While HALL-OPT dynamically adapts token-level computation, the backbone transformer architecture remains fixed. This design choice may not be optimal for heterogeneous edge environments with widely varying compute, memory, and power constraints. Devices at the extreme ends of the spectrum may benefit from more flexible architectural scaling rather than fixed-depth models.

Finally, the hallucination detection mechanism depends on intermediate attention and hidden representations to estimate uncertainty and contextual inconsistency. As a result, extremely shallow or ultra-compact transformer variants may not provide sufficient representational depth for reliable hallucination scoring, limiting the applicability of HALL-OPT in ultra-tiny models.

This type of work will be further explored in the future by utilising dynamic architecture adaptation14,15, which will allow the model to optimally increase its depth and width in accordance with the ongoing constraints of the devices. The other way forward entails combining HALL-OPT with federated learning models18,20 to enable decentralised learning across distributed edge devices without sharing raw data. It is also a good opportunity to explore domain-specific detectors for biomedical, legal, and financial text, where the likelihood of hallucinations is higher.

Future research directions

Several promising research directions emerge from this work. First, future studies will explore dynamic architecture adaptation mechanisms that allow transformer depth and width to scale at runtime based on available device resources and latency budgets. Such adaptive architectures could enable more efficient utilisation of heterogeneous edge platforms without sacrificing reliability.

Second, integrating HALL-OPT with federated learning frameworks represents a natural extension. By combining hallucination-aware optimisation with decentralised training, edge devices can collaboratively improve model reliability while preserving data privacy and avoiding the transmission of raw data.

Third, domain-specific hallucination detection strategies warrant further investigation. Tailoring uncertainty and consistency signals for specialised domains such as healthcare, law, finance, and scientific text may significantly improve robustness under domain-shift conditions where generic attention-entropy assumptions no longer hold.

Finally, future work will investigate real-time guarantees and worst-case execution behaviour under strict timing constraints. Incorporating worst-case latency modelling and predictive execution bounds could further enhance HALL-OPT’s suitability for safety-critical edge systems, including autonomous vehicles, industrial automation, and medical monitoring devices.

Ethical implications

Implementing HALL-OPT in edge devices has significant ethical implications. The frameworks would positively contribute to the safe application of AI in high-stakes scenarios such as healthcare monitoring, industrial automation, and autonomous systems by detecting and mitigating the false results of hallucinations and other unreliable model outputs. On-edge inference also enhances user privacy, as sensitive text does not need to be transferred to cloud services. However, risks remain. Although hallucination rates are minimised in the model predictions, it might still be possible to encounter factual errors or omissions, and overuse of automated decision-making can have undesirable side effects in safety-critical contexts. In addition, other variant language styles, domain-specific terminology, or cultural orientations can influence performance among different users. To overcome these fears, HALL-OPT is to be used as an aid in decision support, not as an alternative to human judgment. The next-generation work process will rely on uncertainty-mindful explanations, more domain-specific protectors, and more holistic assays on populations and in situational settings.

Conclusion

This paper presents HALL-OPT, a unified approach that enhances the reliability and efficiency of transformer-based models running on edge devices. Hallucination-aware attention modelling, dynamic token pruning, and a light architecture obtained via knowledge distillation and quantisation-aware optimisation allow the framework to offer a middle ground between factual consistency and computational constraints when applied in real time. Large-scale testing of SQuAD 2.0 and CNN/DailyMail shows that HALL-OPT can retain high task accuracy while significantly reducing responsiveness and resource allocation across various edge platforms. These findings reinforce the appropriateness of the framework for industrial IoT, autonomous systems, and healthcare monitoring, as well as other new environments that require sensitive model performance and, therefore, warrant trust in the outcomes. In the future, I plan to address the identified constraints by exploring adaptive architecture reconfiguration to minimise training overhead, enhancing pruning behaviour for short sequences, and building robustness-oriented hallucination detector modules that generalise better across domains and modalities2933.

Acknowledgements

The author would like to thank the Deanship of Scientific Research at Shaqra University, Saudi Arabia for supporting this work.

List of symbols

Inline graphic

Input sequence

Inline graphic

Query, key, value matrices

Inline graphic

Attention weights for tokenInline graphic

Inline graphic

Hallucination score for tokenInline graphic

Inline graphic

Entropy function

Inline graphic

Uncertainty measure

Inline graphic

Consistency metric

Inline graphic

Hallucination score weights

Inline graphic

Hallucination detection threshold

Inline graphic

Token importance score at layerInline graphic

Inline graphic

Target token retention ratio

Inline graphic

Teacher and student models

Inline graphic

Teacher and student logits

Inline graphic

Temperature for distillation

Inline graphic

Loss weights

Inline graphic

Quantized weights

Inline graphic

Quantisation scale factor

Inline graphic

Bit-width for quantisation

Inline graphic

Energy components

Inline graphic

Number of transformer layers

Inline graphic

Hidden dimension size

Inline graphic

Sequence length

HAAM

Hallucination-aware attention mechanism

DTP

Dynamic token pruning

AKD

Adaptive knowledge distillation

EOL

Edge optimisation layer

QA

Question answering

NLP

Natural language processing

IoT

Internet of things

FLOPs

Floating point operations

Author contributions

Conceptualization: Danah Algawiaz, Software: Danah Algawiaz, Formal analysis: Danah Algawiaz, Resources: Danah Algawiaz, Writing—review and editing: Danah Algawiaz, Funding acquisition: Danah Algawiaz.

Funding

Danah Algawiaz.

Data availability

The datasets generated and analysed during this study are publicly available at: [https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst](https:/www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst). [https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail](https:/www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail).I have also prepared and made publicly available a comprehensive GitHub repository containing the implementation at **:** https://github.com/DanahAG-R/Hall-OPT/tree/main.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Su, W. et al. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of ACL 2024, Bangkok, Thailand, 14379–14391. 10.18653/v1/2024.findings-acl.854 (2024).
  • 2.Zhou, Q. et al. Training-free transformer architecture search with zero-cost proxy guided evolution. IEEE Trans. Pattern Anal. Mach. Intell.46(10), 6525–6541. 10.1109/TPAMI.2024.3378781 (2024). [DOI] [PubMed] [Google Scholar]
  • 3.Xu, W., Agrawal, S., Briakou, E., Martindale, M. J. & Carpuat, M. Understanding and detecting hallucinations in neural machine translation via model introspection. Trans. Assoc. Comput. Linguist.11, 546–564. 10.1162/tacl_a_00563 (2023). [Google Scholar]
  • 4.Chrysostomou, G., Zhao, Z., Williams, M. & Aletras, N. Investigating hallucinations in pruned large language models for abstractive summarisation. Trans. Assoc. Comput. Linguist.12, 1163–1181. 10.1162/tacl_a_00695 (2024). [Google Scholar]
  • 5.Liu, R. et al. TransKD: Transformer knowledge distillation for efficient semantic segmentation. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2024.3455416 (2024).40727422 [Google Scholar]
  • 6.Luo, K. et al. Efficient coordination of federated learning and inference offloading at the edge: A proactive optimization paradigm. IEEE Trans. Mob. Comput.10.1109/TMC.2024.3466844 (2024). [Google Scholar]
  • 7.Luo, Z., Yan, H. & Pan, X. Optimizing transformer models for resource-constrained environments. J. Comput. Methods Eng. Appl.3(1), 1–12. 10.62836/jcmea.v3i1.030107 (2023). [Google Scholar]
  • 8.Zhang, H. et al. A teacher-free graph knowledge distillation framework. IEEE Trans. Knowl. Data Eng.36 (2), 640–651. 10.1109/TKDE.2024.3374773 (2024). [Google Scholar]
  • 9.Liu, Y. et al. Reducing hallucinations of large language models via hierarchical semantic piece. Complex Intell. Syst.11(5), 1–19. 10.1007/s40747-025-01833-9 (2025). [Google Scholar]
  • 10.Huang, C. Research on attention mechanism optimization. In AIP Conf. Proc., Vol. 3194, no. 1, 050025. 10.1063/5.0222691 (2024).
  • 11.Suwannaphong, T., Jovan, F., Craddock, I. & McConville, R. Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Sci. Rep.15(1), 10081. 10.1038/s41598-025-94205-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Paula, E., Soni, J. S., Upadhyay, H. & Lagos, L. Comparative analysis of model compression techniques for achieving carbon efficient AI. Sci. Rep.15(1), 23461. 10.1038/s41598-025-07821-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Surantha, N. et al. Key considerations for real-time object recognition on edge computing devices. Appl. Sci.15(13), 7533. 10.3390/app15137533 (2025). [Google Scholar]
  • 14.Wang, X. et al. Empowering edge intelligence: A comprehensive survey on on-device AI models. ACM Comput. Surv.57(9), 1–39. 10.1145/3724420 (2025). [Google Scholar]
  • 15.Ren, Z. et al. Near-sensor edge computing system enabled by a CMOS compatible photonic integrated circuit platform using bilayer AlN/Si waveguides. Nano-Micro Lett.17(1), 261. 10.1007/s40820-025-01743-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Papa, L., Russo, P., Amerini, I. & Zhou, L. A survey on efficient vision transformers: Algorithms, techniques, and performance benchmarking. IEEE Trans. Pattern Anal. Mach. Intell.46 (12), 7682–7700. 10.1109/TPAMI.2024.3392941 (2024). [DOI] [PubMed] [Google Scholar]
  • 17.Gou, J. et al. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE Trans. Multimedia26, 7901–7916. 10.1109/TMM.2024.3372833 (2024). [Google Scholar]
  • 18.Singh, N., Rupchandani, J. & Adhikari, M. Personalized federated learning for heterogeneous edge device: Self-knowledge distillation approach. IEEE Trans. Consum. Electron.70 (1), 4625–4632. 10.1109/TCE.2023.3327757 (2023). [Google Scholar]
  • 19.Xu, L., Ren, J., Huang, Z., Zheng, W. & Chen, Y. Improving knowledge distillation via head and tail categories. IEEE Trans. Circuits Syst. Video Technol.34 (5), 3465–3480. 10.1109/TCSVT.2023.3325814 (2023). [Google Scholar]
  • 20.Yao, D. et al. FedGKD: Toward heterogeneous federated learning via global knowledge distillation. IEEE Trans. Comput.73(1), 3–17. 10.1109/TC.2023.3315066 (2023). [Google Scholar]
  • 21.Wu, A., Yu, J., Wang, Y. & Deng, C. Prototype-decomposed knowledge distillation for learning generalized federated representation. IEEE Trans. Multimedia. 10.1109/TMM.2024.3428352 (2024). [Google Scholar]
  • 22.Dan et al. SA-SNN: spiking attention neural network. PeerJ Comput. Sci.10.7717/peerj-cs.2549 (2024). [Google Scholar]
  • 23.Zhang, Q., Wei, X., Wang, Y. & Hou, C. Convolutional neural network with attention mechanism and visual vibration signal analysis for bearing fault diagnosis. Sensors24(6), 1831. 10.3390/s24061831 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cheng, L. Attention mechanism models for precision medicine. Brief. Bioinform.10.1093/bib/bbae156 (2024). [DOI] [PMC free article] [PubMed]
  • 25.Song et al. Efficient knowledge distillation for hybrid models. IET Cyber-Syst Robot. 10.1049/csy2.12120 (2024). [Google Scholar]
  • 26.Arif, M. & Rashid, M. A literature review on model conversion, inference, and learning strategies in EdgeML with TinyML deployment. Comput. Mater. Contin.10.32604/cmc.2025.062819 (2025). [Google Scholar]
  • 27.Shah, S. A. B., Rashid, M. & Arif, M. Estimating WCET using prediction models to compute fitness function of a genetic algorithm. Real Time Syst.56(1), 28–63. 10.1007/s11241-020-09343-2 (2020). [Google Scholar]
  • 28.Rashid, M., Shah, S. A. B., Arif, M. & Kashif, M. Determination of worst-case data using an adaptive surrogate model for real-time system. J. Circuits Syst. Comput.29(1), 2050005. 10.1142/S021812662050005X (2020). [Google Scholar]
  • 29.Tao, H., Zhang, Z., Jiang, B. & Luo, B. Learning efficient linear graph transformer via graph-attention distillation. Mach. Intell. Res.10.1007/s11633-025-1541-9 (2025). [Google Scholar]
  • 30.Banu, S. & Deivalakshmi, S. Enhancing leaf area segmentation using attention gates. J. Telecommun. Inf. Technol.101(3), 51–62. 10.26636/jtit.2025.3.2079 (2025).
  • 31.Wang, D. & Wang, B. Transformer-guided serial knowledge distillation for high-precision anomaly detection. IEEE Access10.1109/ACCESS.2025.3584892 (2025).41059400 [Google Scholar]
  • 32.Wang, W. et al. Optimizing age of information in vehicular edge computing with federated graph neural network multi-agent reinforcement learning 10.48550/arXiv.2407.02342 (2024).
  • 33.He, J., Ji, J. & Lei, M. Spatio-temporal transformer network with physical knowledge distillation for weather forecasting. In Proc. 33rd ACM Int. Conf. Information and Knowledge Management (CIKM), 819–828. 10.1145/3627673.3679841 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and analysed during this study are publicly available at: [https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst](https:/www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst). [https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail](https:/www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail).I have also prepared and made publicly available a comprehensive GitHub repository containing the implementation at **:** https://github.com/DanahAG-R/Hall-OPT/tree/main.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES