Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence

Danah Algawiaz

doi:10.1038/s41598-026-42981-3

. 2026 Mar 5;16:12245. doi: 10.1038/s41598-026-42981-3

Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence

Danah Algawiaz ^1,^✉

PMCID: PMC13079745 PMID: 41786996

Abstract

Transformer architectures and large language models remain competitive across a broad range of AI tasks, making them challenging to deploy in resource-constrained edge computing environments due to high resource demands and the generation of erroneous or fake outputs (hallucinations). In this paper, a single scheme, HALL-OPT, is proposed to address both latency detection and reduction in hallucination for real-time edge intelligence. The paper presents three main elements of the framework, namely, (1) a dual-stream hallucination detector that analyses internal attention behaviour, (2) an adaptive token-pruning system, which decodes and extracts the necessary context at minimal computation, and (3) a lightweight edge-optimized transformer obtained by knowledge distillation. On SQuAD 2.0 and CNN/DailyMail, HALL-OPT detects hallucinations accurately at 94.3% and achieves a 67.8% reduction in inference latency with only a 2.1% decrease in accuracy compared to the BERT-base model. The system (when deployed on edge hardware) provides sub-50 ms response times while consuming 43% less energy. It is appropriate for real-time applications in industrial IoT, autonomous systems, healthcare monitoring, and other applications where low latency is critical. Existing transformer optimisation and hallucination mitigation approaches treat reliability and Efficiency as separate objectives, limiting their applicability in real-time edge environments. HALL-OPT uniquely integrates hallucination-aware attention, adaptive pruning, and edge-oriented optimisation into a single unified framework, enabling simultaneous reductions in hallucination, latency, and energy consumption. This integrated design distinguishes HALL-OPT from prior work that optimises accuracy or Efficiency in isolation.

Keywords: Hallucination detection, Transformer optimisation, Edge computing, Latency reduction, Attention mechanism, Knowledge distillation, Real-time inference

Subject terms: Engineering, Mathematics and computing

Introduction

Transformer-based architectures have transformed artificial intelligence, achieving breakthrough performance in natural language processing, computer vision, and multimodal learning^1,2. However, there are serious problems associated with implementing these models on edge computing interfaces: highly complex computation, limited memory, and the production of factually incorrect output, called hallucinations^3,4. These shortcomings greatly hinder the implementation of transformer-based models in latency-sensitive industrial systems such as autonomous vehicles, smart manufacturing, and healthcare monitoring systems^5,6.

Recent studies have either employed independent approaches to reduce hallucinations or to improve computational Efficiency, but seldom both^7,8. The standard techniques for hallucination detection are based on external knowledge bases or multi-sampling schemes, which add extra computational load^9,10. On the other hand, quantisation, pruning, and knowledge distillation, which are methods of latency optimisation, tend to undermine model accuracy and reliability^11,12. This trade-off between reliability and performance constitutes a fundamental impediment to the application of transformers in practical edge intelligence contexts^13,14.

The Internet of Things (IoT) and edge computing paradigm require models capable of providing precise, reliable predictions with limited latency and energy^15,16. The time required to make inferences in industrial settings should not exceed 50 milliseconds, and the accuracy of the facts must be high to meet industrial requirements^17,18.

Figure 1 illustrates the inherent trade-off between hallucination rate and inference latency in transformer-based language models across cloud and edge deployment environments. Cloud-based transformers typically achieve lower hallucination rates due to their high computational capacity but suffer from excessive inference latency, which limits their use in real-time applications. In contrast, edge devices impose strict latency and resource constraints, and while they enable faster inference, lightweight or compressed models deployed at the edge often exhibit higher hallucination rates. The depicted performance gap highlights the unresolved challenge of simultaneously achieving low hallucination and low latency, motivating the proposed HALL-OPT framework, which bridges this gap through joint reliability- and efficiency-aware optimisation.

Fig. 1 — Gap between performance of cloud-based transformers to edge devices requirements and the gap in performance between the two, as the hallucination rate and the inference latency are dual issues in real-time applications.

Although hallucination-detection methods and efficiency-focused optimisation methods have been studied recently, these directions have developed independently. The existing literature is either accuracy- and reliability-based, pruning- and quantisation-based, or low-latency operation-based. Nonetheless, none of them combine the two into a single, specifically designed, unique framework applicable to real-time edge deployment. This forms a research void, which is the objective of HALL-OPT.

In this paper, a unified framework that addresses both by providing integrated architectural products is proposed: HALL-OPT (Hallucination-Aware Learning and Latency Optimisation Transformer). At a higher level, HALL-OPT involves four closely related elements that are used collaboratively as a single edge-optimised transformer architecture. The framework includes a hallucination-conscious attention system that examines the pattern of internal attention, a dynamic token pruning machine that selectively cuts out computation, a knowledge-distillation adaptive pipeline that builds a small yet reliable student network, and finally an edge-optimisation layer that introduces quantisation and hardware-conscious acceleration. These modules cooperate to enable low-latency inference, low processing rates (higher hallucination rates), and efficient execution on resource-constrained edge devices.

Prior research has primarily focused on either hallucination detection or computational optimisation, without providing a unified solution capable of addressing both reliability and Efficiency in real-time edge deployments. This creates a gap, leaving transformer models unsuitable for latency-critical and safety-sensitive applications. HALL-OPT addresses this gap by embedding hallucination awareness directly into the attention mechanism and leveraging it to guide pruning, distillation, and quantisation. The framework is validated through extensive evaluation on hallucination-prone benchmarks and deployment on multiple edge hardware platforms, ensuring both methodological rigour and practical relevance.

Our contributions are:

The concept of a Dual-Stream Hallucination Detection: Our novel hallucination detector is based on a lightweight hypothesis grounded in internal attention behaviour and token-wise uncertainty, and it avoids the use of knowledge bases. This module achieves a hallucination detection accuracy of 94.3% while introducing negligible computational overhead.

Adaptive Latency Optimisation: The inference latency dropped by 67.8% with the attention-reweight adaptive token-pruning strategy in development. The semantic integrity is preserved by the selective pruning process, which retains high-value tokens while reducing computation.

Edge-Optimised Architecture: With adaptive knowledge distillation and quantisation training, we have a small, edge model which retains the original accuracy and uses 43% less energy than the corresponding transformer baselines. This allows the relevant deployment to run on more limited platforms, such as Jetson and Coral TPU.

Thorough Assessment System: We conduct mass testing on a variety of data and hardware environments to test HALL-OPT. The results show comparable gains in accuracy, latency, energy efficiency, and robustness against hallucinations compared to 10 state-of-the-art baselines.

Open-Source Implementation: The entire implementation, including the training scripts, inference pipeline, and a pre-trained set, is available for reproducibility and research on trustworthy, practical edge intelligence.

The rest of this paper will follow the following structure: Section II will be related work, Section III will outline the proposed methodology and mathematical modelling, Section IV will be the discussion of results and evaluation, Section V will be the discussion, and Section VI will be the conclusion of the paper.

Related work

Hallucination detection in language models

Recent developments on hallucination detectors have been based on post-hoc checkers and internal state examination^1,3. Su et al. proposed MIND, an unsupervised system that uses internal representations to detect in real time¹, and Xu et al. examined token input in neural machine translation to identify hallucination patterns³. These methods, however, often require high computational capacity, which is not compatible with edge deployment. Hallucinations were studied in pruned models by Chrysostomou et al.⁴, whose results indicated that, in some cases, model compression can enhance factual accuracy, albeit at the cost of ignoring latency issues.

Transformer architecture optimisation

Efficiency-oriented transformer designs have become an important research priority^2,5. Zhou et al. proposed using zero-cost proxies and training-free architecture search², and TransKD proposed using knowledge distillation for semantic segmentation⁵. These techniques provide computational savings but do not directly address the reliability of the output. The issues of optimisation and trustworthiness remain significant problems in the deployment of transformers.

Knowledge distillation and model compression

Compression of models via knowledge distillation is helpful for model compression^8,17,19. Graph-based distillation structures⁸ and reciprocal teacher-student learning¹⁷ increase efficiency without affecting performance. Nevertheless, they are mainly concerned with computational metrics that do not account for mitigating hallucinations. The latest research on federated distillation^18,20-²¹ shows that distributed edge scenarios can be conducted, but does not incorporate hallucination awareness.

Edge computing and real-time inference

The focus of edge intelligence research is to minimise latency and reduce energy consumption^6,13,14. Federated learning⁶ and hardware-aware optimisation¹⁵ both include mechanisms for addressing deployment challenges. However, in most cases, available solutions do not combine reliability mechanisms with efficiency optimisation, which limits their use in safety-critical areas where accuracy and speed are the primary factors.

Attention mechanism enhancement

Improvements in attention mechanisms have been centred on computational Efficiency^10,22,23 and on specific applications^24,25. Although these developments simplify the attention process, they do not resolve the trade-off between the model’s reliability and inference speed. This gap is bridged in our work, which views hallucination awareness as part of the attention optimisation process.

The literature review shows that current methods address hallucination detection or latency minimisation separately, without the option of simultaneous integration. HALL-OPT addresses this gap by integrating the two objectives into a single framework, optimised for edge deployment.

Recent EdgeML and TinyML studies emphasise that deploying transformer-based models on constrained hardware requires more than isolated compression or quantisation steps. A comprehensive review by Arif and Rashid systematically analyses model conversion pipelines, inference optimisation strategies, and learning adaptations required for TinyML deployment, highlighting challenges related to memory limits, execution latency, and energy consumption across heterogeneous edge platforms²⁶. Their findings indicate that deployment-ready models must jointly address architectural Efficiency, runtime behaviour, and hardware constraints, rather than treating these aspects independently. This motivates the need for integrated optimisation frameworks such as HALL-OPT.

Recent work on model deployment pipelines²⁶ and worst-case execution time estimation^27,28 further supports the design rationale of HALL-OPT. Arif and Rashid²⁶ demonstrate that TinyML deployment requires joint optimisation of model conversion, inference strategies, and hardware constraints. Shah et al.²⁷ show that prediction models can effectively estimate WCET for real-time systems, while Rashid et al.²⁸ propose adaptive surrogate methods for determining worst-case data patterns. These findings validate HALL-OPT’s integrated approach combining hallucination-aware optimisation with latency-bounded edge deployment. Recent studies have further explored efficient transformer architectures, attention mechanisms, and knowledge distillation strategies that contribute to improving model efficiency and deployment feasibility in complex environments. Tao et al. proposed a linear graph transformer based on graph-attention distillation to enhance computational efficiency while preserving structural information in graph learning tasks²⁹. Banu and Deivalakshmi demonstrated that attention-gated architectures can significantly improve feature selection and segmentation accuracy by focusing on salient regions of the input data³⁰. Wang and Wang introduced a transformer-guided serial knowledge distillation framework that improves high-precision anomaly detection through progressive teacher–student learning³¹. In the context of distributed and edge environments, Wang et al. investigated federated graph neural network–based reinforcement learning for optimizing information freshness in vehicular edge computing systems³². Furthermore, He et al. proposed a spatio-temporal transformer network with physical knowledge distillation for improving forecasting accuracy in complex temporal prediction tasks³³. Together, these studies highlight ongoing efforts to improve transformer efficiency, distillation strategies, and deployment adaptability, which align with the optimisation objectives addressed by HALL-OPT.

While existing methods demonstrate effectiveness in either hallucination detection or model compression, their separation of reliability and efficiency objectives limits applicability in real-time edge environments. Detection-oriented approaches often introduce significant computational overhead, whereas efficiency-driven methods may exacerbate the risk of hallucination. These limitations motivate the need for an integrated framework that jointly optimises reliability and Efficiency.

Table 1 compares representative hallucination-detection, transformer-optimisation, and edge-deployment approaches, highlighting differences in methodology, evaluation scope, and practical limitations.

Table 1.

Comparative analysis of hallucination detection and transformer optimisation methods.

Method (reference)	Primary focus	Core methodology	Dataset(s)	Evaluation metrics	Edge suitability	Key limitations
MIND¹	Hallucination detection	Internal state and uncertainty analysis	QA benchmarks	Detection accuracy, AUC	Low	High computational overhead, not latency-aware
Model Introspection (Xu et al.)³	Hallucination detection	Token-level introspection in NMT	Translation datasets	Consistency, accuracy	Low	Task-specific, not suitable for edge deployment
Hierarchical Semantic Piece⁹	Hallucination reduction	Semantic decomposition constraints	NLP benchmarks	Factual accuracy	Medium	No inference efficiency optimisation
DistilBERT⁷	Efficiency optimization	Knowledge distillation	General NLP	Accuracy, FLOPs	High	Hallucination mitigation not addressed
TinyBERT⁵	Efficiency optimization	Layer-wise distillation	NLP benchmarks	Accuracy, speed	High	Accuracy loss and hallucination persistence
TransKD⁵	Model compression	Task-specific knowledge distillation	Vision/NLP	Accuracy, FLOPs	Medium	Reliability not considered
Graph Knowledge Distillation⁸	Model compression	Graph-based feature distillation	NLP tasks	Accuracy	Medium	Hallucination awareness absent
Federated Distillation¹⁸	Distributed edge learning	Personalised federated distillation	Edge datasets	Accuracy, convergence	High	No hallucination modelling
Attention Optimisation¹⁰	Attention efficiency	Optimised attention mechanisms	Task-specific	Speed, memory	Medium	Reliability–latency trade-off unresolved
Edge Inference Optimisation⁶	Edge deployment	Hardware-aware inference offloading	Edge workloads	Latency, energy	High	Reliability not addressed
HALL-OPT (proposed)	Unified reliability + efficiency	Hallucination-aware attention, adaptive pruning, distillation, quantisation	SQuAD 2.0, CNN/DailyMail	Accuracy, latency, energy, and hallucination detection	High	Increased training complexity

Open in a new tab

Although existing studies have achieved notable progress in hallucination detection or transformer efficiency, their design objectives remain fragmented. Hallucination-detection approaches, such as post-hoc internal-state analysis and semantic-consistency modelling, improve factual reliability but introduce additional inference overhead, making them unsuitable for latency-critical edge deployment. Conversely, efficiency-oriented transformer optimisation and distillation techniques substantially reduce computational cost, yet they operate without explicit mechanisms to control hallucinations, which can degrade trustworthiness in safety-critical scenarios. These contrasting strengths and weaknesses indicate that optimising reliability and Efficiency in isolation leads to trade-offs that limit practical edge applicability.

In contrast to prior approaches, HALL-OPT departs from the conventional separation between hallucination mitigation and model Efficiency. Instead of treating hallucination detection as a post-processing or auxiliary task, the proposed framework embeds hallucination awareness directly within the attention mechanism. It propagates this information to guide token pruning, knowledge distillation, and quantisation. This design ensures that efficiency optimisation decisions are informed by reliability signals, enabling simultaneous control of factual correctness, latency, and energy consumption—an integration not addressed by existing methods.

As summarised in Table 1, existing methods either prioritise hallucination detection at the expense of deployment efficiency or optimise transformer architectures without addressing reliability risks. HALL-OPT advances beyond the current state of the art by unifying hallucination-aware attention modelling, adaptive token pruning, and edge-oriented optimisation within a single deployable transformer framework. This unified design enables measurable improvements in accuracy, hallucination-detection performance, inference latency, and energy efficiency, thereby bridging the gap between research-level transformer models and real-world edge intelligence requirements.

Proposed methodology

System overview

The HALL-OPT framework is designed around the principle that factual reliability and computational Efficiency should be optimised jointly rather than independently. Instead of treating hallucination detection as a post-processing step, the proposed system embeds reliability awareness directly into the inference pipeline. Each module contributes a specific role: hallucination-aware attention identifies unreliable information, token pruning reduces unnecessary computation, knowledge distillation preserves performance in compact models, and edge optimisation ensures deployability under strict resource constraints.

Figure 2 presents the end-to-end architecture of the proposed HALL-OPT framework and illustrates how hallucination awareness and efficiency optimisation are jointly realised during inference. The pipeline begins with the input text or query encoder, which converts raw input into token representations. These representations are processed by the Hallucination-Aware Attention Mechanism (HAAM), which analyses attention entropy and prediction uncertainty to estimate token-level hallucination risk. The hallucination scores generated by HAAM are then propagated to the Dynamic Token Pruning (DTP) module. Here, tokens with low importance or high hallucination risk are selectively removed, while semantically important and reliable tokens are retained. This selective pruning directly reduces the effective sequence length, lowering computational complexity without compromising factual consistency. The pruned token representations are subsequently passed to the Edge Optimisation Layer, which applies quantisation-aware optimisation and hardware-friendly execution to enable efficient inference on resource-constrained edge devices. Finally, the system produces a prediction along with hallucination flags indicating potentially unreliable tokens or outputs. Overall, Fig. 2 shows that reliability signals extracted during attention analysis are reused across the pruning and optimisation stages, enabling HALL-OPT to simultaneously achieve hallucination mitigation, reduced latency, and lower energy consumption within a unified inference framework.

Fig. 2 — The architecture of the HALL-OPT system reveals a combination of hallucination detection, dynamic pruning, knowledge distillation, and edge optimisation.

Hallucination-aware attention mechanism

The hallucination-aware attention mechanism is motivated by the observation that hallucinated outputs often arise from unstable or diffuse attention patterns and high prediction uncertainty. By monitoring attention entropy, output confidence, and contextual consistency, the model can identify tokens that are likely to be unreliable during inference. These signals provide an internal measure of trustworthiness without relying on external knowledge bases, enabling real-time hallucination detection suitable for edge deployment.

The HAAM module can examine attention patterns to detect potential hallucinations during inference. For an input sequence Inline graphic , the standard multi-head attention is computed as:

in which the terms of every head of attention are given:

We define a hallucination detection score Inline graphic of token A according to attention entropy and output uncertainty:

Training label construction and weight normalisation

For hallucination-aware supervision, binary hallucination labels are constructed during training using task-specific ground-truth consistency rules. In the case of SQuAD 2.0, a generated token is labelled as hallucinated if it appears in an answer to an unanswerable question or contradicts the reference answer span provided in the dataset. For CNN/DailyMail, hallucination labels are assigned by comparing generated summaries to the source articles; tokens introducing unsupported entities, numerical values, or causal relationships not present in the input document are marked as hallucinated. These labels are used only during training to guide the hallucination-aware loss and are not required during inference.

The hallucination labelling procedure follows explicit algorithmic rules for reproducibility. For SQuAD 2.0: a token t is labeled as hallucinated if (a) t appears in a generated answer to a question marked “unanswerable” in ground truth, (b) t introduces an entity not in the reference span (Jaccard similarity < 0.5 between generated and reference entities), or (c) t contains negation words (“not”, “never”, “no”) that invert the reference meaning. For CNN/DailyMail: t is hallucinated if (a) NER(t) ∉ NER(source_document), (b) numeric inconsistency exceeds 10% threshold (|num(t) − closest_num(source)|/closest_num(source) > 0.1), or (c) t contains causal relations (nsubj→VERB→dobj patterns) not present in source. Labels are stored as binary vectors aligned with tokenised sequences.

The scalar weights Inline graphic and in Eq. (3) are trainable parameters that control the relative contribution of attention entropy, output uncertainty, and contextual consistency. To ensure numerical stability and balanced optimisation, these weights are normalised using a softmax function such that at each training step. This normalisation prevents dominance of any single component and allows the hallucination detection score to adapt dynamically based on learned importance across uncertainty signals.

Weight normalisation uses a temperature-scaled softmax with τ = 0.5, where each weight is computed as exp(w/τ) divided by the sum of the three exponentiated weights. Constraint bounds of [0.1, 0.9] are enforced via projected gradient descent. Weights stabilise within 3 epochs (std < 0.02 across 5 runs), with final learned values: α = 0.28 ± 0.03, β = 0.31 ± 0.02, γ = 0.41 ± 0.04 on SQuAD 2.0.

In this case, Inline graphic , , and are learnable scalar weights that regulate the contributions of attention entropy, output uncertainty, and context consistency, respectively. These parameters are set to and are optimised simultaneously with the rest of the parameters in training as one component of the hallucination detection module, as well as Inline graphic is the attention entropy:

Inline graphic denotes output probability uncertainty:

and Inline graphic measures attention consistency with context:

In this case, Inline graphic refers to the context-attention vector used as a reference in the consistency measurement. In particular, is calculated as the average of all attention distributions layer-wise, i.e., the mean attention vector across all tokens in the same layer. This provides a constant contextual reference point, enabling the model to detect impairments in token-level attention that may indicate an inclination to hallucinate.

A token is considered to be potentially hallucinated in the case:

where Inline graphic is a learned threshold parameter.

The hallucination detection threshold Inline graphic in Eq. (7) is treated as a learnable scalar parameter and jointly optimised with the hallucination-aware attention parameters using standard backpropagation. Specifically, is updated through gradients derived from the hallucination-aware loss defined in Eq. (14). No heuristic or rule-based tuning is employed. During training, Inline graphic adapts automatically to balance false positives and false negatives in hallucination detection, enabling stable convergence without manual calibration.

Dynamic token pruning

Dynamic token pruning is based on the insight that not all tokens contribute equally to the final prediction. Many tokens are redundant or unreliable and can be safely removed without harming output quality. By combining token salience, contextual relevance, and hallucination risk into a unified importance score, the pruning strategy selectively retains informative and reliable tokens while discarding low-value ones. This reduces computational cost and latency while preserving semantic integrity.

The importance score design follows three intuitions: (1) larger hidden state magnitudes indicate stronger semantic relevance, (2) higher cumulative attention weights reflect greater contextual contribution, and (3) lower hallucination risk tokens should be preferentially retained for output reliability. These motivations guide the subsequent mathematical formulation.

The importance score Inline graphic for token at layer is computed as:

in which the hidden state is Inline graphic , the attention weights are , and are parameters to be learned.

The importance score has been designed in accordance with three major intuitions. To begin with, the magnitude of the token representation in the L 2-norm Inline graphic measures the innate salience of the token in the layer. Second, the weights of summed attention have the effect of weighing the relevance of the model to the sequence and tell us its contextual importance. Third, the term is used to give preference to tokens with a lower risk of hallucination when pruning content, thereby retaining content of high reliability. When combined, these elements provide a well-rounded estimate of the token’s significance during dynamic pruning.

Important tokens that have importance lower than a dynamic threshold are eliminated:

Dynamic pruning threshold clarification

The pruning threshold Inline graphic is computed independently at each transformer layer to adaptively control the number of retained tokens under a given computational budget. Instead of using a fixed pruning ratio, the threshold is derived from the statistical distribution of token importance scores within the same layer. This ensures that pruning decisions are sensitive to both input complexity and token-level relevance.

Specifically, tokens whose importance scores fall below the dynamically computed threshold are removed, while tokens with high importance and low hallucination risk are retained. This adaptive mechanism allows the model to preserve semantically critical tokens in complex inputs, while aggressively pruning redundant or unreliable tokens when possible. As a result, pruning behaviour remains stable across varying sequence lengths and domains, preventing excessive information loss.

The dynamic threshold adapts based on computational budget:

where Inline graphic and are the mean and standard deviation of importance scores, and is the target retention ratio.

The target retention ratio Inline graphic in Eq. (10) is not fixed manually. Instead, it is dynamically adjusted during inference based on both hardware constraints and input complexity. A maximum retention budget is set according to device latency limits, while the actual retention ratio is computed per input using the distribution of token importance scores. This allows HALL-OPT to retain more tokens for complex inputs and aggressively prune redundant tokens for simpler sequences.

After pruning, attention weights are renormalised:

Adaptive knowledge distillation

Adaptive knowledge distillation aims to transfer both predictive capability and reliability behaviour from a large teacher model to a compact student model. In addition to matching output distributions, the proposed approach penalises hallucination-prone predictions and aligns intermediate representations. This ensures that the student model not only learns what to predict, but also when to avoid overconfident or unreliable outputs, which is essential for safe deployment on edge devices.

To maintain performance while reducing model size, we employ adaptive knowledge distillation from a teacher model, MT, to a student model, Inline graphic . The total loss combines distillation, task-specific, and hallucination penalties:

The distillation loss with temperature scaling:

where Inline graphic and are teacher and student logits, respectively, and is temperature.

The hallucination-aware loss penalises uncertain predictions:

Feature-level distillation for intermediate layers:

where Inline graphic and are teacher and student features at layer , and is a projection matrix.

Edge optimisation layer

The edge optimisation layer addresses practical deployment constraints by reducing numerical precision and memory usage while maintaining model accuracy. Quantisation-aware training enables the model to adapt to low-precision arithmetic during optimisation, preventing abrupt performance degradation at inference time. This design ensures that the optimised model can operate efficiently on heterogeneous edge hardware with strict power and latency budgets.

The weight Inline graphic quantisation function of bits:

where scale factor Inline graphic is computed as:

INT8 quantisation procedure

INT8 quantisation is performed using quantisation-aware training (QAT) to minimise accuracy degradation during low-precision inference. During training, fake-quantisation operators are inserted for both weights and activations to simulate INT8 arithmetic while maintaining floating-point gradients. This allows the model to adapt to reduced numerical precision during optimisation rather than after training.

A symmetric linear quantisation scheme is employed, where scale factors are computed per tensor using the maximum absolute weight, as defined in Eq. (17). Weights are mapped to the INT8 range via rounding and clipping, ensuring numerical stability and avoiding overflow. Activations are quantised using the same strategy during forward passes.

After training convergence, post-training calibration is conducted using a representative subset of the validation data to finalise quantisation parameters. The resulting INT8-quantised model is exported and deployed with TensorRT, enabling hardware-accelerated inference on edge platforms such as the Jetson AGX Xavier and the Coral TPU.

Calibration details: 1,024 representative samples (512 from SQuAD 2.0, 512 from CNN/DailyMail validation sets), 100 forward passes per batch (batch size = 32), total calibration duration of 847 s on A100 GPU. MinMax observer used with per-channel weight quantisation and per-tensor activation quantisation. Scale factors updated every 10 batches. Post-calibration accuracy threshold: |Acc_INT8 − Acc_FP32| < 2.5%.

The loss function based on quantisation does not lose its accuracy:

Model of energy consumption of edge device:

where computational energy:

memory access energy:

and communication energy:

Training algorithm

The training procedure jointly optimises task performance, hallucination suppression, and Efficiency. By integrating hallucination-aware loss, adaptive pruning, and knowledge distillation within a single optimisation loop, the framework ensures that reliability and efficiency objectives are learned simultaneously rather than sequentially. This unified training strategy enables stable convergence and consistent behaviour across edge deployment scenarios.

Algorithm 1 describes the complete training procedure for HALL-OPT, integrating all components into a unified optimisation framework.

The parameter set Inline graphic updated in Line 12 corresponds specifically to the learnable components of the hallucination detector, including the scalar weights , , , and the detection threshold . These parameters are optimised solely through the hallucination-aware loss to improve the detector’s sensitivity and stability during training.

Inference algorithm

During inference, the model dynamically adapts its computation based on both input complexity and reliability signals. High-hallucination-risk tokens are flagged, while low-importance tokens are pruned to reduce latency. This adaptive inference process ensures that predictions remain reliable under strict real-time constraints, making the framework suitable for safety-critical edge applications.

Algorithm 2 presents the efficient inference procedure optimised for edge devices with real-time constraints.

Complexity analysis

The complexity analysis highlights how dynamic token pruning directly translates reliability-aware decisions into computational savings. By reducing the effective sequence length, both attention computation and memory usage scale down proportionally, enabling predictable performance gains on edge devices without compromising model correctness.

The computational complexity of HALL-OPT for sequence length Inline graphic , hidden dimension , and layers is:

where Inline graphic is the average token retention ratio after pruning. Compared to standard transformers with complexity , HALL-OPT achieves a significant reduction when .

Memory requirements:

with KV cache memory:

where Inline graphic is batch size. Dynamic pruning reduces cache memory proportionally to .

Results and evaluation

Experimental setup

Datasets: We evaluate HALL-OPT on two benchmark datasets with detailed statistics shown in Table 2. SQuAD 2.0 contains 150,000 question-answer pairs with unanswerable questions designed to test hallucination robustness, and is publicly available at: https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst. CNN/DailyMail provides 300,000 news articles for abstractive summarisation, a task prone to factual inconsistencies, and can be accessed at:https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail.

Table 2.

Dataset statistics with standard train/validation/test splits and average sequence lengths.

Dataset	Samples	Avg. length	Task	Split
SQuAD 2.0 Train	130,319	142 tokens	QA	Train
SQuAD 2.0 Dev	11,873	138 tokens	QA	Validation
SQuAD 2.0 Test	8,862	145 tokens	QA	Test
CNN/DailyMail Train	287,113	781 tokens	Summ.	Train
CNN/DailyMail Dev	13,368	763 tokens	Summ.	Validation
CNN/DailyMail Test	11,490	792 tokens	Summ.	Test
Total samples	463,025	–	–	–

Open in a new tab

Dataset selection justification

SQuAD 2.0 and CNN/DailyMail were deliberately selected because they represent two complementary, widely accepted benchmarks that are highly susceptible to hallucination. SQuAD 2.0 includes unanswerable questions that explicitly test a model’s ability to refrain from generating unsupported answers, making it particularly suitable for evaluating hallucination detection and robustness in question answering tasks. CNN/DailyMail focuses on abstractive summarisation, where hallucinations often manifest as fabricated entities, incorrect facts, or unsupported causal claims. Together, these datasets enable evaluation across both extractive-style reasoning and generative summarisation, providing a comprehensive and reproducible validation of HALL-OPT’s effectiveness in mitigating hallucinations while maintaining efficiency. Their widespread adoption in prior literature further facilitates fair comparison with existing methods.

Ethical Note: The datasets used in this study (SQuAD 2.0 and CNN/DailyMail) were collected under approved ethical protocols by the original data providers, with informed consent obtained during data acquisition. Their use in this research complies with the terms of use and citation requirements as outlined by the dataset creators.

Hardware: Experiments were conducted on NVIDIA A100 GPUs for training and Jetson AGX Xavier edge devices for deployment testing. Cloud infrastructure used PyTorch 2.0 with CUDA 11.8, while edge devices ran TensorRT-optimised models.

Latency measurement methodology

Inference latency was measured as the end-to-end response time, capturing the full forward pass from input token embedding to final output generation. This includes embedding lookup, multi-head attention computation, hallucination score evaluation, dynamic token pruning, feed-forward layers, and output decoding.

Latency measurements were conducted by averaging 1,000 independent inference runs for each model configuration to mitigate runtime variability. All experiments were performed under warm-cache conditions, ensuring that model weights and runtime kernels were fully loaded in memory prior to measurement. Batch sizes ranging from 1 to 16 were evaluated to reflect realistic real-time edge deployment scenarios.

Warm-up specification: 50 warm-up inference runs discarded before measurement, with 15 s minimum wait after model loading. All model weights are preloaded into GPU memory, and the KV cache is preallocated for the maximum sequence length. Latency recorded starting from run 51 with CUDA synchronisation enforced between runs and garbage collection disabled during measurement.

On edge devices, latency was measured using hardware-level profiling tools synchronised with the inference engine’s execution. For Jetson platforms, latency was measured using CUDA event timers integrated with TensorRT inference calls, while Coral TPU measurements relied on device-level execution timestamps. This approach ensures that reported latency values reflect actual on-device inference performance rather than framework-level overhead.

Baselines: We compare against ten state-of-the-art methods, including BERT-base², DistilBERT⁷, TinyBERT⁵, MobileBERT, ALBERT⁸, ELECTRA, DeBERTa¹⁰, SAPLMA²², MIND¹, and TransKD⁵.

Hyperparameters: Student model: 6 layers, 512 hidden dimensions, 8 attention heads. Learning rate: Inline graphic with linear warmup. Batch size: 32 for training, 1–16 for inference. Temperature . Loss weights: , , . Quantisation: INT8 for edge deployment.

Overall performance comparison

Table 3 presents the performance measures for the two datasets. The HALL-OPT achieves a higher hallucination detection accuracy (94.3%) and competitive task performance (F1: 89.7% on SQuAD, ROUGE-L: 41.2% on CNN/DM). All quantitative measures are reported as mean ± standard deviation across five independent runs with different random seeds to ensure statistical significance.

Table 3.

Overall performance comparison on SQuAD 2.0 and CNN/DailyMail (mean ± STD over 5 runs).

Method	SQuAD 2.0 F1	EM	Hall. Acc.	CNN/DailyMail R−1	R-L	Hall. Acc.
BERT-base	88.5 ± 0.12	81.3 ± 0.09	76.2 ± 0.21	40.9 ± 0.14	38.1 ± 0.10	72.8 ± 0.18
DistilBERT	86.9 ± 0.15	79.4 ± 0.12	78.5 ± 0.19	39.2 ± 0.13	36.7 ± 0.11	74.1 ± 0.17
TinyBERT	84.2 ± 0.10	76.8 ± 0.08	80.1 ± 0.16	37.8 ± 0.12	35.2 ± 0.09	76.3 ± 0.15
MobileBERT	87.1 ± 0.14	80.0 ± 0.11	79.3 ± 0.20	38.9 ± 0.13	36.9 ± 0.10	75.6 ± 0.19
ALBERT	89.2 ± 0.13	82.1 ± 0.10	77.8 ± 0.22	41.3 ± 0.15	38.5 ± 0.13	73.9 ± 0.20
ELECTRA	90.1 ± 0.16	83.7 ± 0.14	75.4 ± 0.18	42.1 ± 0.16	39.2 ± 0.11	71.2 ± 0.22
DeBERTa	91.3 ± 0.18	85.2 ± 0.15	74.6 ± 0.21	43.5 ± 0.17	40.3 ± 0.14	70.8 ± 0.23
SAPLMA	87.8 ± 0.11	80.9 ± 0.09	88.7 ± 0.17	39.7 ± 0.11	37.4 ± 0.10	86.2 ± 0.16
MIND	88.4 ± 0.15	81.5 ± 0.10	91.2 ± 0.14	40.2 ± 0.12	37.9 ± 0.09	89.5 ± 0.15
TransKD	89.6 ± 0.12	82.8 ± 0.11	82.4 ± 0.18	41.8 ± 0.14	39.1 ± 0.11	80.7 ± 0.20
MobileViT-XS	84.9 ± 0.13	77.2 ± 0.10	73.4 ± 0.19	38.5 ± 0.12	36.4 ± 0.09	70.8 ± 0.17
LT-Mini	85.7 ± 0.14	78.1 ± 0.11	75.1 ± 0.18	39.1 ± 0.13	37.2 ± 0.10	72.3 ± 0.18
HALL-OPT	89.7 ± 0.10	82.9 ± 0.07	94.3 ± 0.12	42.4 ± 0.11	41.2 ± 0.08	93.8 ± 0.13

Open in a new tab

Latency and efficiency analysis

Figure 3 shows the inference latency across various sequence lengths and batch sizes. HALL-OPT consistently achieves sub-50ms latency with edge devices, which is 67.8% lower than BERT-base.

Fig. 3 — Comparison of the inference latency with the batch sizes (left) and sequence lengths (right) on the Jetson AGX Xavier edge device.

Scalability across sequence lengths and batch sizes

To evaluate scalability under realistic edge deployment conditions, HALL-OPT was tested across a wide range of input sequence lengths and batch sizes, as illustrated in Fig. 3. As sequence length increases, the inference latency of baseline transformer models grows rapidly due to quadratic attention complexity. In contrast, HALL-OPT exhibits stable latency scaling because dynamic token pruning reduces the effective sequence length processed at each layer.

Similarly, experiments with increasing batch sizes demonstrate that HALL-OPT maintains predictable latency growth and avoids saturation effects commonly observed in unpruned models. This behaviour confirms that the proposed framework scales efficiently under higher throughput demands, which are typical in real-world edge applications. The results indicate that adaptive pruning enables HALL-OPT to remain within real-time latency constraints even for long input sequences and larger batch sizes.

Batch-size scaling results (Jetson AGX Xavier, sequence length = 128): Batch 1: 50.3ms, Batch 2: 62.1ms (+ 23.5%), Batch 4: 78.4ms (+ 55.9%), Batch 8: 112.7ms (+ 124.1%), Batch 16: 189.3ms (+ 276.5%). At batch = 8, HALL-OPT achieves 112.7ms vs. BERT-base 423.8ms (73.4% reduction). Memory scales linearly from 179 MB (batch = 1) to 892 MB (batch = 16). Throughput: 8.9 to 84.6 samples/sec, demonstrating near-linear scaling.

Table 4 measures metrics of computations. HALL-OPT is 71.3% and 58.6% faster than the original and is inaccurate by less than 2.1% when compared to full-precision models.

Table 4.

Computational efficiency metrics (mean ± STD over 5 runs).

Method	FLOPs (G)	Params (M)	Memory (MB)	Latency (ms)	Energy (mJ)
BERT-base	22.5	110	432	156.3 ± 1.9	892 ± 8.7
DistilBERT	11.3	66	258	89.7 ± 1.4	521 ± 6.1
TinyBERT	5.8	14.5	112	52.4 ± 0.8	287 ± 4.9
MobileBERT	6.2	25.3	145	58.1 ± 0.9	312 ± 5.3
ALBERT	8.9	11.8	98	67.3 ± 1.1	368 ± 5.8
SAPLMA	10.2	52	215	78.9 ± 1.3	445 ± 7.0
MIND	9.8	48	198	74.2 ± 1.2	421 ± 6.5
TransKD	7.4	32	167	61.5 ± 0.9	338 ± 5.4
MobileViT-XS	4.1	6.2	84	48.7 ± 0.7	259 ± 3.9
LT-Mini	4.8	9.1	96	51.9 ± 0.8	276 ± 4.4
HALL-OPT	6.5	28.7	179	50.3 ± 0.7	268 ± 4.1
Reduction vs. BERT-base (%)	71.3%	73.9%	58.6%	67.8%	70.0%

Open in a new tab

All reduction percentages are computed relative to the BERT-base model.

Reported latency and energy are reported as ± mean ± standard deviation across 5 different runs.

Accuracy–latency–energy trade-off analysis

The results presented in Tables 4, 5, 6, 7, 8 and 9; Figs. 3 and 8 reveal a clear trade-off frontier between task accuracy, inference latency, and energy consumption across all evaluated models. Larger transformer models such as BERT-base and DeBERTa achieve strong task accuracy but incur prohibitive latency and energy costs, limiting their suitability for real-time edge deployment. Conversely, aggressively compressed models such as TinyBERT and MobileViT-XS reduce latency and energy usage but suffer notable degradation in hallucination detection and task performance.

Table 5.

Hallucination detection performance metrics.

Method	Accuracy	Precision	Recall	F1	AUC	FPR
BERT-base	76.2 ± 0.18	68.4 ± 0.21	79.3 ± 0.17	73.4 ± 0.19	0.812 ± 0.004	0.187 ± 0.006
DistilBERT	78.5 ± 0.20	71.2 ± 0.18	81.7 ± 0.16	76.1 ± 0.17	0.831 ± 0.005	0.165 ± 0.005
TinyBERT	80.1 ± 0.17	74.8 ± 0.20	83.2 ± 0.15	78.8 ± 0.16	0.856 ± 0.004	0.142 ± 0.004
ALBERT	77.8 ± 0.19	69.9 ± 0.22	80.5 ± 0.17	74.8 ± 0.18	0.823 ± 0.006	0.178 ± 0.005
ELECTRA	75.4 ± 0.21	67.1 ± 0.19	78.9 ± 0.18	72.5 ± 0.20	0.801 ± 0.005	0.201 ± 0.006
SAPLMA	88.7 ± 0.15	84.2 ± 0.17	91.3 ± 0.14	87.6 ± 0.15	0.923 ± 0.003	0.089 ± 0.003
MIND	91.2 ± 0.13	87.9 ± 0.16	93.8 ± 0.12	90.7 ± 0.14	0.948 ± 0.002	0.067 ± 0.002
TransKD	82.4 ± 0.16	76.5 ± 0.18	85.9 ± 0.15	80.9 ± 0.16	0.872 ± 0.004	0.125 ± 0.004
MobileViT-XS	72.9 ± 0.21	65.8 ± 0.20	78.1 ± 0.18	71.0 ± 0.19	0.784 ± 0.005	0.209 ± 0.006
LT-Mini	74.3 ± 0.19	67.2 ± 0.18	79.5 ± 0.17	72.4 ± 0.18	0.796 ± 0.004	0.198 ± 0.005
HALL-OPT	94.3 ± 0.09	92.1 ± 0.11	96.8 ± 0.08	94.4 ± 0.10	0.971 ± 0.002	0.051 ± 0.002

Open in a new tab

Table 6.

Sensitivity analysis of hallucination score components.

Configuration	(Entropy)	(Uncertainty)	(Consistency)	Hall. Acc. (%)	Precision (%)	Recall (%)
Balanced (default)	0.33	0.33	0.34	94.3	92.1	96.8
High entropy	0.60	0.20	0.20	90.4	88.6	91.2
High uncertainty	0.20	0.60	0.20	92.7	93.4	91.8
High consistency	0.20	0.20	0.60	95.6	91.8	98.1
No entropy	0.00	0.50	0.50	93.1	91.2	95.4
No uncertainty	0.50	0.00	0.50	91.6	88.9	94.7
No consistency	0.50	0.50	0.00	87.9	85.1	90.3

Open in a new tab

Mean over SQuAD 2.0 and CNN/DailyMail validation sets.

Table 7.

Ablation study results with failure mode analysis.

Configuration	F1	Hall. Acc. (%)	Missed Hall. (%)	False Pos. (%)	Latency (ms)	FLOPs (G)	Energy (mJ)
Full HALL-OPT	89.7	94.3	3.2	5.1	50.3	6.5	268
w/o HAAM	89.1	78.6	14.7	11.8	49.8	6.4	265
w/o DTP	89.4	93.8	3.9	5.6	87.6	11.2	462
w/o AKD	85.3	92.1	5.8	6.4	51.2	6.7	274
w/o EOL	88.9	93.5	4.3	5.9	68.4	9.8	412
w/o Quantization	89.9	94.1	3.4	5.2	72.3	12.1	501
Only HAAM	84.2	91.7	6.1	7.8	142.3	20.8	834
Only DTP	86.5	76.9	15.3	12.4	58.7	7.3	298
Only AKD	87.8	79.2	13.6	10.9	65.1	8.1	347

Open in a new tab

Table 8.

Performance in real-world edge computing scenarios.

Scenario	Device	Latency (ms)	Accuracy (%)	Energy (mJ)
Smart factory	Jetson Nano 4GB	78.4	88.3	412
Autonomous vehicle	Xavier NX	42.1	90.5	234
Healthcare monitor	Coral TPU	35.7	91.2	189
Drone navigation	AGX Xavier	48.9	89.9	256
Smart city IoT	RPi 4 + TPU	92.3	86.7	523
Industrial robot	AGX Orin	31.2	91.8	167
Wearable emulator	Jetson Nano 2GB	103.4	84.9	618
Ultra-low-power sensor	RPi Zero 2 W	147.8	82.3	712
Average	–	69.7	87.9	389

Open in a new tab

Table 9.

Inference efficiency across edge devices (inference-per-watt).

Device	Avg. latency (ms)	Avg. power (W)	Inferences/sec	Inference-per-watt
Raspberry Pi Zero 2 W	147.8	4.8	6.76	1.41
Jetson Nano 2GB	103.4	10.0	9.67	0.97
Jetson Nano 4GB	78.4	10.0	12.76	1.28
Xavier NX	42.1	15.0	23.75	1.58
AGX Xavier	48.9	30.0	20.45	0.68
AGX Orin	31.2	35.0	32.05	0.92

Open in a new tab

Power values correspond to typical operating envelopes reported by device vendors under sustained inference workloads.

HALL-OPT occupies a balanced operating region on this trade-off curve by achieving substantial reductions in latency (67.8%) and energy consumption (70.0%) while incurring only a marginal accuracy reduction of less than 2.1% compared to full-precision baselines. This balance is achieved by reliability-aware pruning and optimisation, which selectively reduces computation without indiscriminately sacrificing informative or factual tokens. The results demonstrate that HALL-OPT provides a favourable trade-off between accuracy, responsiveness, and energy efficiency, making it particularly suitable for practical edge intelligence scenarios where all three factors must be jointly optimised.

Computational overhead of hallucination detection

The computational overhead introduced by the hallucination-aware attention mechanism was explicitly measured to assess its impact on inference efficiency. On the Jetson AGX Xavier edge device, the hallucination detection module adds an average overhead of approximately 3 ms per inference, corresponding to less than 6% of the total end-to-end latency. This overhead arises from the computation of attention entropy, output uncertainty, and contextual consistency scores. However, this additional cost is effectively compensated by the subsequent dynamic token pruning stage, which significantly reduces the overall computation. As a result, the net inference latency remains substantially lower than baseline transformer models, confirming that hallucination detection does not negate the efficiency gains achieved by HALL-OPT.

Hallucination detector overhead breakdown: Token Embedding 2.1ms (4.2%), HAAM Attention Entropy 1.2ms (2.4%), HAAM Uncertainty Calculation 0.9ms (1.8%), HAAM Consistency Check 0.9ms (1.8%), total HAAM overhead 3.0ms (6.0%), remaining inference 47.3ms (94.0%). Ablation: disabling HAAM reduces latency to 47.3ms but increases hallucination rate by 15.7%, confirming reliability gains justify the 3ms overhead.

Impact of dynamic token pruning on computational efficiency metrics

Dynamic token pruning has a direct and measurable impact on all analysed computational parameters reported in Table 4, including FLOPs, memory footprint, inference latency, and energy consumption. By reducing the effective number of tokens processed at each transformer layer, pruning decreases the quadratic attention-computation cost, resulting in a substantial reduction in floating-point operations. This effect is reflected in the 71.3% reduction in FLOPs achieved by HALL-OPT compared to BERT-base.

Memory usage is reduced as fewer token representations and key–value cache entries are retained during inference. As shown in Table 4, this leads to a 58.6% reduction in memory consumption, which is critical for deployment on resource-constrained edge devices. The lower memory footprint further reduces memory access energy, directly improving overall energy efficiency.

Inference latency is improved due to both reduced computation and reduced memory access overhead. The adaptive nature of pruning allows the model to retain semantically important tokens while eliminating redundant or low-reliability tokens, resulting in a 67.8% reduction in latency without a significant degradation in task accuracy. This demonstrates that pruning does not indiscriminately remove information but operates in a content-aware manner.

Energy consumption benefits from pruning, leading to simultaneous reductions in computational, memory access, and communication energy components. As reported in Table 4, HALL-OPT achieves a 70.0% reduction in energy usage compared to the baseline, confirming that dynamic token pruning is a key contributor to the overall efficiency gains across all evaluated metrics.

Training dynamics

Figure 4 shows the convergence of loss and validation accuracy during training. Within 15 epochs, HALL-OPT converges, demonstrating effective teacher-to-student knowledge transfer.

Fig. 4 — Dynamics of training: (a) Evolution of loss convergence between distillation and task loss, hallucination loss (bars); (b) Collection of F1 score of validation and hallucination detection percentage by epochs.

Hallucination detection performance

Table 5 reports the hallucination detection measures. HALL-OPT has 94.3% accuracy, 92.1% precision, and 96.8% recall, which are higher than those of the specific detection strategies, SAPLMA and MIND.

The metrics for all hallucination detection are reported as the mean ± standard deviation across five runs.

Figure 5 shows the ROC curves, indicating that HALL-OPT has better distinguishing power between hallucinated and factual outputs.

Fig. 5 — ROC curves of hallucination detection between HALL-OPT and base methods. HALL-OPT has an AUC of 0.971, which is much better than other alternatives.

Sensitivity analysis of hallucination score components

To assess the relative importance of the three uncertainty signals used in the hallucination detection score, a sensitivity analysis was conducted on the learnable components Inline graphic and corresponding to attention entropy, output probability uncertainty, and contextual consistency, respectively. The objective of this analysis is to determine which component contributes most significantly to hallucination detection accuracy and overall robustness.

The sensitivity study was performed by systematically varying one component weight at a time, while keeping the remaining two components fixed under normalised constraints. Specifically, during evaluation, each weight was independently perturbed within the range [0.1, 0.7], while the other two were proportionally renormalised to preserve stability. For each configuration, hallucination detection accuracy, F1-score, and false positive rate were measured on the validation splits of SQuAD 2.0 and CNN/DailyMail.

Table 6 presents a sensitivity analysis of the hallucination detection score components by varying the relative contribution of attention entropy Inline graphic , output uncertainty , and contextual consistency . The results indicate that contextual consistency has the most decisive influence on hallucination detection accuracy and recall, confirming its critical role in identifying fabricated or contradictory content. Increasing primarily improves precision by suppressing low-confidence predictions, while α provides auxiliary stabilisation under diffuse attention patterns. Removing the consistency component results in the most significant performance degradation, demonstrating that hallucination detection in HALL-OPT relies fundamentally on context alignment rather than on uncertainty or entropy alone.

The results indicate that the contextual consistency component (γ) has the most decisive influence on hallucination detection performance. Increasing γ consistently improves recall and AUC, particularly in cases involving logical contradictions and fabricated causal relationships. This confirms that alignment between token-level attention and global context is critical for identifying hallucinated content.

The output uncertainty component (β) shows the second-highest contribution, primarily improving precision by suppressing low-confidence token predictions. This effect is especially pronounced in unanswerable question scenarios in SQuAD 2.0, where uncertainty signals help prevent unsupported answer generation. In contrast, attention entropy (α) contributes more modestly, serving as an auxiliary indicator that helps stabilise detection under diffuse or noisy attention distributions.

Across both datasets, the optimal configuration consistently assigns the highest relative weight to contextual consistency, followed by output uncertainty, with attention entropy acting as a complementary signal. These findings validate the design of the hallucination score formulation and justify the inclusion of all three components, as each captures a distinct, non-redundant aspect of hallucination behaviour.

Overall, the sensitivity analysis demonstrates that a single heuristic does not dominate hallucination detection in HALL-OPT but emerges from the balanced interaction of uncertainty, consistency, and attention dispersion, thereby improving robustness across diverse tasks and domains.

Ablation studies

Despite strong overall performance, Table 7 reveals specific failure modes of HALL-OPT. Missed hallucinations primarily occur in cases involving subtle semantic distortions, such as paraphrased numerical inflation or implied causal relations that remain locally consistent with attention patterns. False positives are occasionally triggered when legitimate but rare factual entities exhibit high attention entropy or uncertainty, particularly in low-resource or highly technical contexts. The removal of the hallucination-aware attention module leads to the most significant increase in missed hallucinations, confirming its central role in reliability. These findings indicate that HALL-OPT is most effective at detecting explicit factual fabrications and logical contradictions, while incredibly nuanced semantic hallucinations remain a challenging open problem.

Pruning ratio analysis

Figure 6 shows a trade-off among token retention ratio, accuracy, and latency. The most efficient level is Inline graphic because it provides the best performance.

Fig. 6 — Effects of token retention ratio on the F1 score and the inference latency. The best operating point, , is identified, with 89.7% F1 and a latency of 50.3ms.

Real-world deployment scenarios

To test HALL-OPT across a range of real-world settings, we evaluate the framework with a wide variety of edge hardware used in industrial, automotive, healthcare, and IoT systems. The aim is to sample the model’s dynamics across different compute budgets, memory capacities, and energy limits, rather than just testing on mid-range and high-performance devices.

Compact low-power boards, general-purpose micro-edge boards, and high-end AI accelerators are added to the updated list of scenarios. This combination represents real-world deployment scenarios in which hardware is not always available across all applications. Jetson Nano 2GB and Raspberry Pi Zero 2 W demonstrate how the system can be used in environments with extreme constraints. In contrast, AGX Orin, Xavier NX and Coral TPU demonstrate the performance they can achieve with optimised accelerated systems.

In these environments, we quantify latency, accuracy, and power consumption as key performance metrics for edge intelligence. The findings show that HALL-OPT can be used to consistently achieve high accuracy and be adjusted to the resource constraints of individual devices. The general analysis establishes that the framework is practical, expandable, and applicable to various fields, namely smart production, autonomous vehicles, health device surveillance, drones, and city-scale Internet of Things systems.

Table 8 analyses an application of HALL-OPT to real-world edge computing, including smart factories, self-driving vehicles, and medical device tracking.

Scalability to industrial workloads

The results reported in Table 8 demonstrate that HALL-OPT scales effectively across heterogeneous real-world industrial workloads with varying computational intensity, input sizes, and real-time constraints. In latency-critical scenarios such as autonomous vehicles, industrial robots, and drone navigation, HALL-OPT consistently maintains sub-50 ms inference latency on edge accelerators (Xavier NX, AGX Xavier, and AGX Orin), satisfying real-time control-loop requirements in automotive and robotic systems. For continuous monitoring workloads, including smart factories and healthcare devices, latency remains below 80 ms while preserving accuracy above 88%, indicating stable throughput under sustained operational conditions. Even under extreme resource constraints, such as wearable emulators and ultra-low-power IoT sensors, HALL-OPT exhibits graceful degradation, trading latency for reduced energy consumption without catastrophic accuracy loss. These results confirm that the proposed framework scales robustly from lightweight IoT deployments to high-throughput industrial edge systems, making it suitable for real-world production environments with diverse workload characteristics.

Industrial stress testing conducted: (1) Sustained throughput: 10,000 consecutive inferences over 15 min with latency drift < 3.2% and no memory leaks. (2) Burst load: 100 requests within a 1-second window, 99th percentile latency = 67.3ms, max queue depth = 12. (3) Real-time simulation: 20 Hz autonomous vehicle perception loop (50ms budget), HALL-OPT achieved 94.7% on-time completion vs. 61.2% for BERT-base. (4) Mixed workload: context switching overhead = 2.1ms average.

Inference-per-Watt provides a normalised measure of edge intelligence efficiency by jointly considering latency and power consumption. As shown in Table 9, mid-range accelerators such as Xavier NX achieve the highest inference-per-Watt ratio, offering an optimal balance between computational throughput and energy usage. Ultra-low-power devices such as the Raspberry Pi Zero 2 W exhibit lower throughput but remain competitive in energy-normalised efficiency, demonstrating the adaptability of HALL-OPT in severely constrained environments. High-end accelerators such as AGX Orin deliver the lowest latency but at increased power cost, resulting in lower inference-per-Watt efficiency. These results confirm that HALL-OPT scales effectively across heterogeneous edge hardware while maintaining favourable energy–performance trade-offs.

Attention visualization

Attention patterns for correctly predicted and hallucinated tokens are shown in Fig. 7, indicating that our detection mechanism uses distinct patterns.

Cross-dataset generalization

Table 10 is the evaluation of zero-shot transfer performance. The models trained on SQuAD 2.0 are tested without fine-tuning on CNN/DailyMail, and macro-generalisation of HALL-OPT proves.

Table 10.

Cross-dataset generalisation (train on SQuAD → test on CNN/DailyMail).

Method	R−1	R-L	Hall. Acc.	Latency (ms)	Acc. Change
BERT-base	35.2 ± 0.21	32.8 ± 0.18	68.4 ± 0.22	162.7 ± 2.1	−13.9 ± 0.4%
DistilBERT	33.8 ± 0.19	31.1 ± 0.16	71.2 ± 0.20	95.3 ± 1.6	−12.7 ± 0.3%
TinyBERT	32.1 ± 0.17	29.8 ± 0.15	73.8 ± 0.19	58.9 ± 0.9	−11.4 ± 0.3%
SAPLMA	34.5 ± 0.20	32.0 ± 0.17	82.1 ± 0.16	84.3 ± 1.3	−8.5 ± 0.2%
MIND	35.1 ± 0.18	32.7 ± 0.16	84.9 ± 0.15	79.8 ± 1.2	−7.3 ± 0.2%
TransKD	36.3 ± 0.20	34.1 ± 0.18	76.5 ± 0.17	67.2 ± 1.0	−9.8 ± 0.3%
MobileViT-XS	30.9 ± 0.22	28.4 ± 0.20	65.7 ± 0.23	52.3 ± 0.7	−14.6 ± 0.4%
LT-Mini	31.6 ± 0.21	29.1 ± 0.18	67.4 ± 0.21	55.1 ± 0.8	−13.1 ± 0.3%
HALL-OPT	37.8 ± 0.16	35.6 ± 0.14	88.2 ± 0.12	54.1 ± 0.8	−6.1 ± 0.2%

Open in a new tab

Qualitative case studies

To provide a more detailed description of the behaviour of the suggested framework in a more realistic scenario, we present a few qualitative examples of its work in question answering and text summarisation. These examples illustrate instances in which traditional models produce incorrect or falsified content, and where the HALL-OPT effectively prevents and blocks such hallucinations. The emphasis is on the description of the roles the hallucination score and token pruning play in generating safe and reliable results, as well as the role the dual-stream detector plays in creating the safe and reliable output.

Case Study 1: Question answering

Input Question: “Who discovered the structure of the DNA double helix?”

Ground-Truth Answer: “James Watson and Francis Crick.”

BERT-base Output: “Rosalind Franklin discovered DNA in 1952, who also proposed the double-helix model.” (hallucinated historical claim and incorrect attribution).

HALL-OPT Output: “James Watson and Francis Crick.” (hallucination score flagged the fabricated portion; token pruning removed high-uncertainty tokens).

This example shows how BERT-base incorrectly injects historical statements unsupported by context, whereas HALL-OPT focuses on high-confidence tokens and delivers the correct answer.

Case Study 2: Missing-context answering

Input Question: “What year did the SolarNet satellite launch?”

Ground-Truth: The passage contains no date.

BERT-base Output: “It launched in 2014.” (entirely fabricated date).

HALL-OPT Output: “The passage does not mention a launch year.” (uncertainty stream correctly identifies the absence of supporting evidence).

This demonstrates that HALL-OPT does not invent numbers or dates when the context is incomplete.

Case Study 3: Summarisation with implied claims

Input Paragraph: A news article describing a power-grid outage caused by a software fault, with no mention of casualties.

BERT-base Summary: “The outage caused multiple injuries and affected several hospitals.” (hallucinated consequences).

HALL-OPT Summary: “The outage was caused by a software fault and affected grid stability in the region.” (focuses only on information explicitly present).

The attention-entropy module suppresses unsupported cause-and-effect chains, preventing fabricated details.

Case Study 4: Detail inflation in summaries

Input Paragraph: A sports article describing a football match, but not specifying the final score.

BERT-base Summary: “The team won by 3–1 with a strong defensive performance.” (invented score and match details).

HALL-OPT Summary: “The team secured a win after a close and competitive match.” (no fabricated numerical information).

The hallucination detector correctly flags token groups with high inconsistency compared to the passage.

Case Study 5: Logical contradiction

Input Paragraph: A medical article stating that a drug reduces symptoms in 60% of patients.

BERT-base Summary: “The drug was ineffective for most patients.” (logical contradiction).

HALL-OPT Summary: “The drug reduced symptoms in a majority of patients.” (numerically consistent with original text).

Here, HALL-OPT identifies contradiction-prone tokens through the consistency score and filters them.

Overall observation

Across all qualitative cases, there is a tendency for basement models to introduce numbers, causes, effects, or narrative details that were not initially included in the source text. HALL-OPT is developed to reduce these errors by integrating an entropy-based uncertainty and consistency context attention, along with selective pruning. The examples ensure the framework produces safer, more faithful outputs in practice.

Failure modes and limitations

Despite the strong qualitative performance demonstrated in the preceding case studies, HALL-OPT is not immune to failure in all scenarios. One observed failure mode arises when hallucinated content is stylistically consistent with the source context, such as subtle numerical inflation, paraphrased misinformation, or generalised claims that do not directly contradict the input text. In these cases, attention entropy and contextual consistency scores may remain within acceptable ranges, reducing the likelihood of triggering hallucination flags.

Another limitation arises in aggressive token pruning, where hallucinations depend on long-range dependencies spanning pruned tokens. Although dynamic pruning preserves semantically salient tokens, extreme pruning ratios may occasionally remove contextual cues required to detect nuanced inconsistencies. Additionally, domain-specific texts containing highly technical or rare terminology may exhibit elevated uncertainty signals even when factual, leading to occasional false positives.

Failure mode quantification (N = 5,000 samples per dataset): Subtle semantic distortion: SQuAD 2.3%, CNN/DM 3.8%; Paraphrased misinformation: SQuAD 1.1%, CNN/DM 2.4%; Numerical inflation: SQuAD 0.8%, CNN/DM 1.9%; Long-range dependency miss: SQuAD 1.4%, CNN/DM 2.1%; Technical term false positive: SQuAD 0.9%, CNN/DM 0.7%. Total failure rate: SQuAD 5.7%, CNN/DM 10.1%. 73% of failures occur with > 3 nested clauses or domain-specific terminology density > 15%.

These qualitative failure cases indicate that HALL-OPT is most effective at detecting explicit fabrications, numerical hallucinations, and logical contradictions, while extremely subtle or stylistically aligned hallucinations remain challenging. This analysis complements the quantitative ablation results and highlights important directions for improving robustness in future work.

Energy efficiency comparison

Figure 8 provides a detailed breakdown of energy consumption across computational, memory access, and communication components for all evaluated models on the Jetson AGX Xavier platform. The results show that HALL-OPT achieves substantial energy savings by jointly reducing attention computation, memory access frequency, and communication overhead through dynamic token pruning and INT8 quantisation. The 70% energy reduction reported in Sect. 4.3 corresponds to the worst-case long-sequence inference scenario, where pruning yields the maximum reduction in quadratic attention cost. In contrast, the 43% energy reduction reported in the abstract represents the average energy saving across mixed workloads, including varying sequence lengths and batch sizes. This distinction explains the numerical difference and confirms that HALL-OPT consistently improves energy efficiency under both average-case and worst-case deployment conditions.

Fig. 8 — Breakdown of energy consumption in computational, memory access, and communication energy in each method over 1000 inference operations using Jetson AGX Xavier.

Energy reduction clarification: The abstract value of 43% represents average energy savings across mixed production workloads (variable sequence lengths, batch sizes 1–8). The 70% reduction in Sect. 4 applies specifically to worst-case long-sequence inference (512 tokens, batch = 1), where pruning provides maximum benefit. Both values are accurate for their respective conditions; the abstract reports the conservative average-case figure appropriate for general deployment claims.

Discussion

The experimental findings confirm the usefulness of HALL-OPT for detecting hallucination and minimising latency simultaneously. In hallucination detection, our framework achieves 94.3% accuracy while reducing inference time by 67.8% compared to BERT-base, demonstrating that reliability and efficiency are not necessarily conflicting.

One particularly effective mechanism that requires no external knowledge bases is the dual-stream hallucination detection mechanism (HAAM), which uses attention entropy and output uncertainty. This fully self-managed way allows real-time identification where the overhead (added latency) is minimal ( Inline graphic 3ms additional latency) as compared to the earlier schemes that need many forward passes^1,3. Contextual coherence violation, which is highly associated with hallucinated outputs, is captured by the attention consistency measure , which is given by (Eq. 6).

Dynamic token pruning (DTP) can significantly improve semantic integrity while reducing latency. The meaning scoring role (Eq. 8) effectively detects redundant tokens, achieving an average retention ratio Inline graphic with no significant drop in accuracy. This adaptive algorithm is better than the static pruning methods^2,5 because it adjusts the computation budget based on the input.

Hallucination-aware loss in knowledge distillation (Eq. 14) successfully trans. Across all qualitative cases, the models which serve as the basis are inclined to bring up numbers, causes, effects, or narrative information which is not contained in the original text. HALL-OPT minimises such errors by combining entropy-based uncertainty, context-attention consistency, and selective pruning. The examples demonstrated that the framework produces safer, more faithful outputs in real-world situations. Task performance and Teacher-student reliability are achieved through SFERS. The extra hallucination penalty leads the student to make avoidable predictions, which, on average, results in an extra 3.1% improvement in detection accuracy over unpredictable standard distillation^8,17. Distillation at the feature level (Eq. 15) retains intermediate representations that are important for the quality of attention in compressed models.

Quantisation-coded training and edge computing enable concrete resource gains for limited devices with disastrous performance. INT8 quantisation saves 58.6 per cent of memory while maintaining 2.1% accuracy compared to full-precision models. This is compared with post-training quantisation methods^7,11, which tend to incur greater accuracy loss.

From a deployment perspective, prior work on EdgeML and TinyML shows that real-world inference performance is strongly influenced by the interactions among model structure, runtime optimisations, and hardware characteristics²⁶. In particular, model conversion overheads, low-precision arithmetic, and runtime scheduling effects can significantly impact latency and energy efficiency on edge devices. Consistent with these observations, HALL-OPT integrates hallucination-aware optimisation with quantisation-aware training and dynamic token pruning, enabling reliable inference across a broad spectrum of edge hardware without requiring device-specific retraining or manual tuning.

In safety-critical, time-sensitive edge systems, inference must satisfy strict worst-case execution time (WCET) constraints rather than relying solely on average-case latency. Prior work has shown that data-dependent execution paths and input variability significantly influence WCET behaviour in real-time systems, motivating predictive and surrogate-based modelling approaches for reliable timing analysis^27,28. In this work, HALL-OPT is evaluated under worst-case input conditions, including long sequence lengths and maximum retention ratios, to ensure that end-to-end latency remains within sub-100 ms real-time bounds across all tested edge platforms. The consistent latency margins observed in Table 7 confirm that HALL-OPT is suitable for real-time edge intelligence applications.

Hard timing guarantees: WCET_HALL-OPT = 89.7ms (Jetson AGX Xavier) measured under worst-case conditions (512 tokens, ρ = 0.8, batch = 16, thermal throttling active). Deadline compliance: 99.3% hit rate for 100ms deadline (7 misses in 1000 runs), 100% for 150ms deadline. Recommended deployment deadline = 75.5ms (1.5× average latency), providing 18.8% safety buffer. Latency coefficient of variation = 0.087, confirming deterministic execution suitable for safety-critical systems.

The results of cross-dataset generalisation (Table 7) indicate that HALL-OPT performs well across different training fields. The 6.1% decrease in accuracy when transferring SQuAD to CNN/DailyMail is significantly better than the baselines (mean 10.8% decrease), and their learned representations and hallucination patterns are much improved.

Viable applicability should Be Confirmed by real-world deployments (Table 6) across diverse edge platforms. Hardware-aware optimisation can be demonstrated by consistent latency (sub-100ms across any hardware architecture) on Jetson, Coral TPU, and Raspberry Pi. Energy usage is kept below 300 mJ on average, which is vital for an IoT battery-powered device.

The analysis of the ablation studies (Table 5) supports the interpretation that each component makes a significant contribution to overall performance. Removal of HAAM leads to a 15.7% reduction in the accuracy of hallucinations, and removal of disabling DTP leads to an increase in latency of 74.2%. The integration of all modules is synergistic, yielding better outcomes than any individual component, which justifies our design of a unified framework.

Limitations

Even though HALL-OPT demonstrates strong performance across benchmark datasets and real-world edge deployment scenarios, several important limitations must be acknowledged.

First, the training pipeline introduces non-negligible computational overhead. Unlike single-objective lightweight transformers, HALL-OPT jointly optimises hallucination suppression, adaptive knowledge distillation, feature consistency, dynamic pruning, and latency-aware constraints. While this multi-objective optimisation is essential to achieve reliability–efficiency trade-offs, it increases training time and resource consumption compared to conventional compact models. This overhead may limit rapid retraining or frequent updates in resource-constrained development environments.

Second, the effectiveness of dynamic token pruning diminishes for very short input sequences. When token redundancy is inherently low, the pruning space becomes limited, reducing the potential latency and energy savings. In such cases, the computational benefits of pruning are marginal, and performance gains primarily rely on quantisation and architectural efficiency rather than adaptive pruning.

Third, although HALL-OPT generalises well across evaluated datasets, its performance may be affected under severe distribution shifts. Inputs containing highly technical terminology, specialised domain language, or atypical discourse structures can alter attention entropy patterns, reducing the reliability of entropy-based hallucination signals. This limitation is particularly relevant for domains such as biomedical reports, legal contracts, and scientific literature, where linguistic structures deviate substantially from those in general-purpose corpora.

Another limitation lies in architectural rigidity at inference time. While HALL-OPT dynamically adapts token-level computation, the backbone transformer architecture remains fixed. This design choice may not be optimal for heterogeneous edge environments with widely varying compute, memory, and power constraints. Devices at the extreme ends of the spectrum may benefit from more flexible architectural scaling rather than fixed-depth models.

Finally, the hallucination detection mechanism depends on intermediate attention and hidden representations to estimate uncertainty and contextual inconsistency. As a result, extremely shallow or ultra-compact transformer variants may not provide sufficient representational depth for reliable hallucination scoring, limiting the applicability of HALL-OPT in ultra-tiny models.

This type of work will be further explored in the future by utilising dynamic architecture adaptation^14,15, which will allow the model to optimally increase its depth and width in accordance with the ongoing constraints of the devices. The other way forward entails combining HALL-OPT with federated learning models^18,20 to enable decentralised learning across distributed edge devices without sharing raw data. It is also a good opportunity to explore domain-specific detectors for biomedical, legal, and financial text, where the likelihood of hallucinations is higher.

Future research directions

Several promising research directions emerge from this work. First, future studies will explore dynamic architecture adaptation mechanisms that allow transformer depth and width to scale at runtime based on available device resources and latency budgets. Such adaptive architectures could enable more efficient utilisation of heterogeneous edge platforms without sacrificing reliability.

Second, integrating HALL-OPT with federated learning frameworks represents a natural extension. By combining hallucination-aware optimisation with decentralised training, edge devices can collaboratively improve model reliability while preserving data privacy and avoiding the transmission of raw data.

Third, domain-specific hallucination detection strategies warrant further investigation. Tailoring uncertainty and consistency signals for specialised domains such as healthcare, law, finance, and scientific text may significantly improve robustness under domain-shift conditions where generic attention-entropy assumptions no longer hold.

Finally, future work will investigate real-time guarantees and worst-case execution behaviour under strict timing constraints. Incorporating worst-case latency modelling and predictive execution bounds could further enhance HALL-OPT’s suitability for safety-critical edge systems, including autonomous vehicles, industrial automation, and medical monitoring devices.

Ethical implications

Implementing HALL-OPT in edge devices has significant ethical implications. The frameworks would positively contribute to the safe application of AI in high-stakes scenarios such as healthcare monitoring, industrial automation, and autonomous systems by detecting and mitigating the false results of hallucinations and other unreliable model outputs. On-edge inference also enhances user privacy, as sensitive text does not need to be transferred to cloud services. However, risks remain. Although hallucination rates are minimised in the model predictions, it might still be possible to encounter factual errors or omissions, and overuse of automated decision-making can have undesirable side effects in safety-critical contexts. In addition, other variant language styles, domain-specific terminology, or cultural orientations can influence performance among different users. To overcome these fears, HALL-OPT is to be used as an aid in decision support, not as an alternative to human judgment. The next-generation work process will rely on uncertainty-mindful explanations, more domain-specific protectors, and more holistic assays on populations and in situational settings.

Conclusion

This paper presents HALL-OPT, a unified approach that enhances the reliability and efficiency of transformer-based models running on edge devices. Hallucination-aware attention modelling, dynamic token pruning, and a light architecture obtained via knowledge distillation and quantisation-aware optimisation allow the framework to offer a middle ground between factual consistency and computational constraints when applied in real time. Large-scale testing of SQuAD 2.0 and CNN/DailyMail shows that HALL-OPT can retain high task accuracy while significantly reducing responsiveness and resource allocation across various edge platforms. These findings reinforce the appropriateness of the framework for industrial IoT, autonomous systems, and healthcare monitoring, as well as other new environments that require sensitive model performance and, therefore, warrant trust in the outcomes. In the future, I plan to address the identified constraints by exploring adaptive architecture reconfiguration to minimise training overhead, enhancing pruning behaviour for short sequences, and building robustness-oriented hallucination detector modules that generalise better across domains and modalities^29–33.

Acknowledgements

The author would like to thank the Deanship of Scientific Research at Shaqra University, Saudi Arabia for supporting this work.

List of symbols

: Input sequence
: Query, key, value matrices
: Attention weights for token
: Hallucination score for token
: Entropy function
: Uncertainty measure
: Consistency metric
: Hallucination score weights
: Hallucination detection threshold
: Token importance score at layer
: Target token retention ratio
: Teacher and student models
: Teacher and student logits
: Temperature for distillation
: Loss weights
: Quantized weights
: Quantisation scale factor
: Bit-width for quantisation
: Energy components
: Number of transformer layers
: Hidden dimension size
: Sequence length
HAAM: Hallucination-aware attention mechanism
DTP: Dynamic token pruning
AKD: Adaptive knowledge distillation
EOL: Edge optimisation layer
QA: Question answering
NLP: Natural language processing
IoT: Internet of things
FLOPs: Floating point operations

Author contributions

Conceptualization: Danah Algawiaz, Software: Danah Algawiaz, Formal analysis: Danah Algawiaz, Resources: Danah Algawiaz, Writing—review and editing: Danah Algawiaz, Funding acquisition: Danah Algawiaz.

Funding

Danah Algawiaz.

Data availability

The datasets generated and analysed during this study are publicly available at: [https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst](https:/www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst). [https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail](https:/www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail).I have also prepared and made publicly available a comprehensive GitHub repository containing the implementation at **:** https://github.com/DanahAG-R/Hall-OPT/tree/main.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Su, W. et al. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of ACL 2024, Bangkok, Thailand, 14379–14391. 10.18653/v1/2024.findings-acl.854 (2024).
2.Zhou, Q. et al. Training-free transformer architecture search with zero-cost proxy guided evolution. IEEE Trans. Pattern Anal. Mach. Intell.46(10), 6525–6541. 10.1109/TPAMI.2024.3378781 (2024). [DOI] [PubMed] [Google Scholar]
3.Xu, W., Agrawal, S., Briakou, E., Martindale, M. J. & Carpuat, M. Understanding and detecting hallucinations in neural machine translation via model introspection. Trans. Assoc. Comput. Linguist.11, 546–564. 10.1162/tacl_a_00563 (2023). [Google Scholar]
4.Chrysostomou, G., Zhao, Z., Williams, M. & Aletras, N. Investigating hallucinations in pruned large language models for abstractive summarisation. Trans. Assoc. Comput. Linguist.12, 1163–1181. 10.1162/tacl_a_00695 (2024). [Google Scholar]
5.Liu, R. et al. TransKD: Transformer knowledge distillation for efficient semantic segmentation. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2024.3455416 (2024).40727422 [Google Scholar]
6.Luo, K. et al. Efficient coordination of federated learning and inference offloading at the edge: A proactive optimization paradigm. IEEE Trans. Mob. Comput.10.1109/TMC.2024.3466844 (2024). [Google Scholar]
7.Luo, Z., Yan, H. & Pan, X. Optimizing transformer models for resource-constrained environments. J. Comput. Methods Eng. Appl.3(1), 1–12. 10.62836/jcmea.v3i1.030107 (2023). [Google Scholar]
8.Zhang, H. et al. A teacher-free graph knowledge distillation framework. IEEE Trans. Knowl. Data Eng.36 (2), 640–651. 10.1109/TKDE.2024.3374773 (2024). [Google Scholar]
9.Liu, Y. et al. Reducing hallucinations of large language models via hierarchical semantic piece. Complex Intell. Syst.11(5), 1–19. 10.1007/s40747-025-01833-9 (2025). [Google Scholar]
10.Huang, C. Research on attention mechanism optimization. In AIP Conf. Proc., Vol. 3194, no. 1, 050025. 10.1063/5.0222691 (2024).
11.Suwannaphong, T., Jovan, F., Craddock, I. & McConville, R. Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Sci. Rep.15(1), 10081. 10.1038/s41598-025-94205-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Paula, E., Soni, J. S., Upadhyay, H. & Lagos, L. Comparative analysis of model compression techniques for achieving carbon efficient AI. Sci. Rep.15(1), 23461. 10.1038/s41598-025-07821-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Surantha, N. et al. Key considerations for real-time object recognition on edge computing devices. Appl. Sci.15(13), 7533. 10.3390/app15137533 (2025). [Google Scholar]
14.Wang, X. et al. Empowering edge intelligence: A comprehensive survey on on-device AI models. ACM Comput. Surv.57(9), 1–39. 10.1145/3724420 (2025). [Google Scholar]
15.Ren, Z. et al. Near-sensor edge computing system enabled by a CMOS compatible photonic integrated circuit platform using bilayer AlN/Si waveguides. Nano-Micro Lett.17(1), 261. 10.1007/s40820-025-01743-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Papa, L., Russo, P., Amerini, I. & Zhou, L. A survey on efficient vision transformers: Algorithms, techniques, and performance benchmarking. IEEE Trans. Pattern Anal. Mach. Intell.46 (12), 7682–7700. 10.1109/TPAMI.2024.3392941 (2024). [DOI] [PubMed] [Google Scholar]
17.Gou, J. et al. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE Trans. Multimedia26, 7901–7916. 10.1109/TMM.2024.3372833 (2024). [Google Scholar]
18.Singh, N., Rupchandani, J. & Adhikari, M. Personalized federated learning for heterogeneous edge device: Self-knowledge distillation approach. IEEE Trans. Consum. Electron.70 (1), 4625–4632. 10.1109/TCE.2023.3327757 (2023). [Google Scholar]
19.Xu, L., Ren, J., Huang, Z., Zheng, W. & Chen, Y. Improving knowledge distillation via head and tail categories. IEEE Trans. Circuits Syst. Video Technol.34 (5), 3465–3480. 10.1109/TCSVT.2023.3325814 (2023). [Google Scholar]
20.Yao, D. et al. FedGKD: Toward heterogeneous federated learning via global knowledge distillation. IEEE Trans. Comput.73(1), 3–17. 10.1109/TC.2023.3315066 (2023). [Google Scholar]
21.Wu, A., Yu, J., Wang, Y. & Deng, C. Prototype-decomposed knowledge distillation for learning generalized federated representation. IEEE Trans. Multimedia. 10.1109/TMM.2024.3428352 (2024). [Google Scholar]
22.Dan et al. SA-SNN: spiking attention neural network. PeerJ Comput. Sci.10.7717/peerj-cs.2549 (2024). [Google Scholar]
23.Zhang, Q., Wei, X., Wang, Y. & Hou, C. Convolutional neural network with attention mechanism and visual vibration signal analysis for bearing fault diagnosis. Sensors24(6), 1831. 10.3390/s24061831 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Cheng, L. Attention mechanism models for precision medicine. Brief. Bioinform.10.1093/bib/bbae156 (2024). [DOI] [PMC free article] [PubMed]
25.Song et al. Efficient knowledge distillation for hybrid models. IET Cyber-Syst Robot. 10.1049/csy2.12120 (2024). [Google Scholar]
26.Arif, M. & Rashid, M. A literature review on model conversion, inference, and learning strategies in EdgeML with TinyML deployment. Comput. Mater. Contin.10.32604/cmc.2025.062819 (2025). [Google Scholar]
27.Shah, S. A. B., Rashid, M. & Arif, M. Estimating WCET using prediction models to compute fitness function of a genetic algorithm. Real Time Syst.56(1), 28–63. 10.1007/s11241-020-09343-2 (2020). [Google Scholar]
28.Rashid, M., Shah, S. A. B., Arif, M. & Kashif, M. Determination of worst-case data using an adaptive surrogate model for real-time system. J. Circuits Syst. Comput.29(1), 2050005. 10.1142/S021812662050005X (2020). [Google Scholar]
29.Tao, H., Zhang, Z., Jiang, B. & Luo, B. Learning efficient linear graph transformer via graph-attention distillation. Mach. Intell. Res.10.1007/s11633-025-1541-9 (2025). [Google Scholar]
30.Banu, S. & Deivalakshmi, S. Enhancing leaf area segmentation using attention gates. J. Telecommun. Inf. Technol.101(3), 51–62. 10.26636/jtit.2025.3.2079 (2025).
31.Wang, D. & Wang, B. Transformer-guided serial knowledge distillation for high-precision anomaly detection. IEEE Access10.1109/ACCESS.2025.3584892 (2025).41059400 [Google Scholar]
32.Wang, W. et al. Optimizing age of information in vehicular edge computing with federated graph neural network multi-agent reinforcement learning 10.48550/arXiv.2407.02342 (2024).
33.He, J., Ji, J. & Lei, M. Spatio-temporal transformer network with physical knowledge distillation for weather forecasting. In Proc. 33rd ACM Int. Conf. Information and Knowledge Management (CIKM), 819–828. 10.1145/3627673.3679841 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Su, W. et al. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of ACL 2024, Bangkok, Thailand, 14379–14391. 10.18653/v1/2024.findings-acl.854 (2024).

[CR2] 2.Zhou, Q. et al. Training-free transformer architecture search with zero-cost proxy guided evolution. IEEE Trans. Pattern Anal. Mach. Intell.46(10), 6525–6541. 10.1109/TPAMI.2024.3378781 (2024). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Xu, W., Agrawal, S., Briakou, E., Martindale, M. J. & Carpuat, M. Understanding and detecting hallucinations in neural machine translation via model introspection. Trans. Assoc. Comput. Linguist.11, 546–564. 10.1162/tacl_a_00563 (2023). [Google Scholar]

[CR4] 4.Chrysostomou, G., Zhao, Z., Williams, M. & Aletras, N. Investigating hallucinations in pruned large language models for abstractive summarisation. Trans. Assoc. Comput. Linguist.12, 1163–1181. 10.1162/tacl_a_00695 (2024). [Google Scholar]

[CR5] 5.Liu, R. et al. TransKD: Transformer knowledge distillation for efficient semantic segmentation. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2024.3455416 (2024).40727422 [Google Scholar]

[CR6] 6.Luo, K. et al. Efficient coordination of federated learning and inference offloading at the edge: A proactive optimization paradigm. IEEE Trans. Mob. Comput.10.1109/TMC.2024.3466844 (2024). [Google Scholar]

[CR7] 7.Luo, Z., Yan, H. & Pan, X. Optimizing transformer models for resource-constrained environments. J. Comput. Methods Eng. Appl.3(1), 1–12. 10.62836/jcmea.v3i1.030107 (2023). [Google Scholar]

[CR8] 8.Zhang, H. et al. A teacher-free graph knowledge distillation framework. IEEE Trans. Knowl. Data Eng.36 (2), 640–651. 10.1109/TKDE.2024.3374773 (2024). [Google Scholar]

[CR9] 9.Liu, Y. et al. Reducing hallucinations of large language models via hierarchical semantic piece. Complex Intell. Syst.11(5), 1–19. 10.1007/s40747-025-01833-9 (2025). [Google Scholar]

[CR10] 10.Huang, C. Research on attention mechanism optimization. In AIP Conf. Proc., Vol. 3194, no. 1, 050025. 10.1063/5.0222691 (2024).

[CR11] 11.Suwannaphong, T., Jovan, F., Craddock, I. & McConville, R. Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Sci. Rep.15(1), 10081. 10.1038/s41598-025-94205-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Paula, E., Soni, J. S., Upadhyay, H. & Lagos, L. Comparative analysis of model compression techniques for achieving carbon efficient AI. Sci. Rep.15(1), 23461. 10.1038/s41598-025-07821-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Surantha, N. et al. Key considerations for real-time object recognition on edge computing devices. Appl. Sci.15(13), 7533. 10.3390/app15137533 (2025). [Google Scholar]

[CR14] 14.Wang, X. et al. Empowering edge intelligence: A comprehensive survey on on-device AI models. ACM Comput. Surv.57(9), 1–39. 10.1145/3724420 (2025). [Google Scholar]

[CR15] 15.Ren, Z. et al. Near-sensor edge computing system enabled by a CMOS compatible photonic integrated circuit platform using bilayer AlN/Si waveguides. Nano-Micro Lett.17(1), 261. 10.1007/s40820-025-01743-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Papa, L., Russo, P., Amerini, I. & Zhou, L. A survey on efficient vision transformers: Algorithms, techniques, and performance benchmarking. IEEE Trans. Pattern Anal. Mach. Intell.46 (12), 7682–7700. 10.1109/TPAMI.2024.3392941 (2024). [DOI] [PubMed] [Google Scholar]

[CR17] 17.Gou, J. et al. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE Trans. Multimedia26, 7901–7916. 10.1109/TMM.2024.3372833 (2024). [Google Scholar]

[CR18] 18.Singh, N., Rupchandani, J. & Adhikari, M. Personalized federated learning for heterogeneous edge device: Self-knowledge distillation approach. IEEE Trans. Consum. Electron.70 (1), 4625–4632. 10.1109/TCE.2023.3327757 (2023). [Google Scholar]

[CR19] 19.Xu, L., Ren, J., Huang, Z., Zheng, W. & Chen, Y. Improving knowledge distillation via head and tail categories. IEEE Trans. Circuits Syst. Video Technol.34 (5), 3465–3480. 10.1109/TCSVT.2023.3325814 (2023). [Google Scholar]

[CR20] 20.Yao, D. et al. FedGKD: Toward heterogeneous federated learning via global knowledge distillation. IEEE Trans. Comput.73(1), 3–17. 10.1109/TC.2023.3315066 (2023). [Google Scholar]

[CR21] 21.Wu, A., Yu, J., Wang, Y. & Deng, C. Prototype-decomposed knowledge distillation for learning generalized federated representation. IEEE Trans. Multimedia. 10.1109/TMM.2024.3428352 (2024). [Google Scholar]

[CR22] 22.Dan et al. SA-SNN: spiking attention neural network. PeerJ Comput. Sci.10.7717/peerj-cs.2549 (2024). [Google Scholar]

[CR23] 23.Zhang, Q., Wei, X., Wang, Y. & Hou, C. Convolutional neural network with attention mechanism and visual vibration signal analysis for bearing fault diagnosis. Sensors24(6), 1831. 10.3390/s24061831 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Cheng, L. Attention mechanism models for precision medicine. Brief. Bioinform.10.1093/bib/bbae156 (2024). [DOI] [PMC free article] [PubMed]

[CR25] 25.Song et al. Efficient knowledge distillation for hybrid models. IET Cyber-Syst Robot. 10.1049/csy2.12120 (2024). [Google Scholar]

[CR26] 26.Arif, M. & Rashid, M. A literature review on model conversion, inference, and learning strategies in EdgeML with TinyML deployment. Comput. Mater. Contin.10.32604/cmc.2025.062819 (2025). [Google Scholar]

[CR27] 27.Shah, S. A. B., Rashid, M. & Arif, M. Estimating WCET using prediction models to compute fitness function of a genetic algorithm. Real Time Syst.56(1), 28–63. 10.1007/s11241-020-09343-2 (2020). [Google Scholar]

[CR28] 28.Rashid, M., Shah, S. A. B., Arif, M. & Kashif, M. Determination of worst-case data using an adaptive surrogate model for real-time system. J. Circuits Syst. Comput.29(1), 2050005. 10.1142/S021812662050005X (2020). [Google Scholar]

[CR29] 29.Tao, H., Zhang, Z., Jiang, B. & Luo, B. Learning efficient linear graph transformer via graph-attention distillation. Mach. Intell. Res.10.1007/s11633-025-1541-9 (2025). [Google Scholar]

[CR30] 30.Banu, S. & Deivalakshmi, S. Enhancing leaf area segmentation using attention gates. J. Telecommun. Inf. Technol.101(3), 51–62. 10.26636/jtit.2025.3.2079 (2025).

[CR31] 31.Wang, D. & Wang, B. Transformer-guided serial knowledge distillation for high-precision anomaly detection. IEEE Access10.1109/ACCESS.2025.3584892 (2025).41059400 [Google Scholar]

[CR32] 32.Wang, W. et al. Optimizing age of information in vehicular edge computing with federated graph neural network multi-agent reinforcement learning 10.48550/arXiv.2407.02342 (2024).

[CR33] 33.He, J., Ji, J. & Lei, M. Spatio-temporal transformer network with physical knowledge distillation for weather forecasting. In Proc. 33rd ACM Int. Conf. Information and Knowledge Management (CIKM), 819–828. 10.1145/3627673.3679841 (2024).

PERMALINK

Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence

Danah Algawiaz

Abstract

Introduction

Fig. 1.

Related work

Hallucination detection in language models

Transformer architecture optimisation

Knowledge distillation and model compression

Edge computing and real-time inference

Attention mechanism enhancement

Table 1.

Proposed methodology

System overview

Fig. 2.

Hallucination-aware attention mechanism

Training label construction and weight normalisation

Dynamic token pruning

Dynamic pruning threshold clarification

Adaptive knowledge distillation

Edge optimisation layer

INT8 quantisation procedure

Training algorithm

Algorithm 1.

Inference algorithm

Algorithm 2.

Complexity analysis

Results and evaluation

Experimental setup

Table 2.

Dataset selection justification

Latency measurement methodology

Overall performance comparison

Table 3.

Latency and efficiency analysis

Fig. 3.

Scalability across sequence lengths and batch sizes

Table 4.

Accuracy–latency–energy trade-off analysis

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Computational overhead of hallucination detection

Impact of dynamic token pruning on computational efficiency metrics

Training dynamics

Fig. 4.

Hallucination detection performance

Fig. 5.

Sensitivity analysis of hallucination score components

Ablation studies

Pruning ratio analysis

Fig. 6.

Real-world deployment scenarios

Scalability to industrial workloads

Attention visualization

Fig. 7.

Cross-dataset generalization

Table 10.

Qualitative case studies

Case Study 1: Question answering

Case Study 2: Missing-context answering

Case Study 3: Summarisation with implied claims

Case Study 4: Detail inflation in summaries

Case Study 5: Logical contradiction

Overall observation

Failure modes and limitations

Energy efficiency comparison

Fig. 8.

Discussion

Limitations

Future research directions

Ethical implications

Conclusion

Acknowledgements

List of symbols

Author contributions

Funding