Abstract
Transformer architectures and large language models remain competitive across a broad range of AI tasks, making them challenging to deploy in resource-constrained edge computing environments due to high resource demands and the generation of erroneous or fake outputs (hallucinations). In this paper, a single scheme, HALL-OPT, is proposed to address both latency detection and reduction in hallucination for real-time edge intelligence. The paper presents three main elements of the framework, namely, (1) a dual-stream hallucination detector that analyses internal attention behaviour, (2) an adaptive token-pruning system, which decodes and extracts the necessary context at minimal computation, and (3) a lightweight edge-optimized transformer obtained by knowledge distillation. On SQuAD 2.0 and CNN/DailyMail, HALL-OPT detects hallucinations accurately at 94.3% and achieves a 67.8% reduction in inference latency with only a 2.1% decrease in accuracy compared to the BERT-base model. The system (when deployed on edge hardware) provides sub-50 ms response times while consuming 43% less energy. It is appropriate for real-time applications in industrial IoT, autonomous systems, healthcare monitoring, and other applications where low latency is critical. Existing transformer optimisation and hallucination mitigation approaches treat reliability and Efficiency as separate objectives, limiting their applicability in real-time edge environments. HALL-OPT uniquely integrates hallucination-aware attention, adaptive pruning, and edge-oriented optimisation into a single unified framework, enabling simultaneous reductions in hallucination, latency, and energy consumption. This integrated design distinguishes HALL-OPT from prior work that optimises accuracy or Efficiency in isolation.
Keywords: Hallucination detection, Transformer optimisation, Edge computing, Latency reduction, Attention mechanism, Knowledge distillation, Real-time inference
Subject terms: Engineering, Mathematics and computing
Introduction
Transformer-based architectures have transformed artificial intelligence, achieving breakthrough performance in natural language processing, computer vision, and multimodal learning1,2. However, there are serious problems associated with implementing these models on edge computing interfaces: highly complex computation, limited memory, and the production of factually incorrect output, called hallucinations3,4. These shortcomings greatly hinder the implementation of transformer-based models in latency-sensitive industrial systems such as autonomous vehicles, smart manufacturing, and healthcare monitoring systems5,6.
Recent studies have either employed independent approaches to reduce hallucinations or to improve computational Efficiency, but seldom both7,8. The standard techniques for hallucination detection are based on external knowledge bases or multi-sampling schemes, which add extra computational load9,10. On the other hand, quantisation, pruning, and knowledge distillation, which are methods of latency optimisation, tend to undermine model accuracy and reliability11,12. This trade-off between reliability and performance constitutes a fundamental impediment to the application of transformers in practical edge intelligence contexts13,14.
The Internet of Things (IoT) and edge computing paradigm require models capable of providing precise, reliable predictions with limited latency and energy15,16. The time required to make inferences in industrial settings should not exceed 50 milliseconds, and the accuracy of the facts must be high to meet industrial requirements17,18.
Figure 1 illustrates the inherent trade-off between hallucination rate and inference latency in transformer-based language models across cloud and edge deployment environments. Cloud-based transformers typically achieve lower hallucination rates due to their high computational capacity but suffer from excessive inference latency, which limits their use in real-time applications. In contrast, edge devices impose strict latency and resource constraints, and while they enable faster inference, lightweight or compressed models deployed at the edge often exhibit higher hallucination rates. The depicted performance gap highlights the unresolved challenge of simultaneously achieving low hallucination and low latency, motivating the proposed HALL-OPT framework, which bridges this gap through joint reliability- and efficiency-aware optimisation.
Fig. 1.
Gap between performance of cloud-based transformers to edge devices requirements and the gap in performance between the two, as the hallucination rate and the inference latency are dual issues in real-time applications.
Although hallucination-detection methods and efficiency-focused optimisation methods have been studied recently, these directions have developed independently. The existing literature is either accuracy- and reliability-based, pruning- and quantisation-based, or low-latency operation-based. Nonetheless, none of them combine the two into a single, specifically designed, unique framework applicable to real-time edge deployment. This forms a research void, which is the objective of HALL-OPT.
In this paper, a unified framework that addresses both by providing integrated architectural products is proposed: HALL-OPT (Hallucination-Aware Learning and Latency Optimisation Transformer). At a higher level, HALL-OPT involves four closely related elements that are used collaboratively as a single edge-optimised transformer architecture. The framework includes a hallucination-conscious attention system that examines the pattern of internal attention, a dynamic token pruning machine that selectively cuts out computation, a knowledge-distillation adaptive pipeline that builds a small yet reliable student network, and finally an edge-optimisation layer that introduces quantisation and hardware-conscious acceleration. These modules cooperate to enable low-latency inference, low processing rates (higher hallucination rates), and efficient execution on resource-constrained edge devices.
Prior research has primarily focused on either hallucination detection or computational optimisation, without providing a unified solution capable of addressing both reliability and Efficiency in real-time edge deployments. This creates a gap, leaving transformer models unsuitable for latency-critical and safety-sensitive applications. HALL-OPT addresses this gap by embedding hallucination awareness directly into the attention mechanism and leveraging it to guide pruning, distillation, and quantisation. The framework is validated through extensive evaluation on hallucination-prone benchmarks and deployment on multiple edge hardware platforms, ensuring both methodological rigour and practical relevance.
Our contributions are:
The concept of a Dual-Stream Hallucination Detection: Our novel hallucination detector is based on a lightweight hypothesis grounded in internal attention behaviour and token-wise uncertainty, and it avoids the use of knowledge bases. This module achieves a hallucination detection accuracy of 94.3% while introducing negligible computational overhead.
Adaptive Latency Optimisation: The inference latency dropped by 67.8% with the attention-reweight adaptive token-pruning strategy in development. The semantic integrity is preserved by the selective pruning process, which retains high-value tokens while reducing computation.
Edge-Optimised Architecture: With adaptive knowledge distillation and quantisation training, we have a small, edge model which retains the original accuracy and uses 43% less energy than the corresponding transformer baselines. This allows the relevant deployment to run on more limited platforms, such as Jetson and Coral TPU.
Thorough Assessment System: We conduct mass testing on a variety of data and hardware environments to test HALL-OPT. The results show comparable gains in accuracy, latency, energy efficiency, and robustness against hallucinations compared to 10 state-of-the-art baselines.
Open-Source Implementation: The entire implementation, including the training scripts, inference pipeline, and a pre-trained set, is available for reproducibility and research on trustworthy, practical edge intelligence.
The rest of this paper will follow the following structure: Section II will be related work, Section III will outline the proposed methodology and mathematical modelling, Section IV will be the discussion of results and evaluation, Section V will be the discussion, and Section VI will be the conclusion of the paper.
Related work
Hallucination detection in language models
Recent developments on hallucination detectors have been based on post-hoc checkers and internal state examination1,3. Su et al. proposed MIND, an unsupervised system that uses internal representations to detect in real time1, and Xu et al. examined token input in neural machine translation to identify hallucination patterns3. These methods, however, often require high computational capacity, which is not compatible with edge deployment. Hallucinations were studied in pruned models by Chrysostomou et al.4, whose results indicated that, in some cases, model compression can enhance factual accuracy, albeit at the cost of ignoring latency issues.
Transformer architecture optimisation
Efficiency-oriented transformer designs have become an important research priority2,5. Zhou et al. proposed using zero-cost proxies and training-free architecture search2, and TransKD proposed using knowledge distillation for semantic segmentation5. These techniques provide computational savings but do not directly address the reliability of the output. The issues of optimisation and trustworthiness remain significant problems in the deployment of transformers.
Knowledge distillation and model compression
Compression of models via knowledge distillation is helpful for model compression8,17,19. Graph-based distillation structures8 and reciprocal teacher-student learning17 increase efficiency without affecting performance. Nevertheless, they are mainly concerned with computational metrics that do not account for mitigating hallucinations. The latest research on federated distillation18,20-21 shows that distributed edge scenarios can be conducted, but does not incorporate hallucination awareness.
Edge computing and real-time inference
The focus of edge intelligence research is to minimise latency and reduce energy consumption6,13,14. Federated learning6 and hardware-aware optimisation15 both include mechanisms for addressing deployment challenges. However, in most cases, available solutions do not combine reliability mechanisms with efficiency optimisation, which limits their use in safety-critical areas where accuracy and speed are the primary factors.
Attention mechanism enhancement
Improvements in attention mechanisms have been centred on computational Efficiency10,22,23 and on specific applications24,25. Although these developments simplify the attention process, they do not resolve the trade-off between the model’s reliability and inference speed. This gap is bridged in our work, which views hallucination awareness as part of the attention optimisation process.
The literature review shows that current methods address hallucination detection or latency minimisation separately, without the option of simultaneous integration. HALL-OPT addresses this gap by integrating the two objectives into a single framework, optimised for edge deployment.
Recent EdgeML and TinyML studies emphasise that deploying transformer-based models on constrained hardware requires more than isolated compression or quantisation steps. A comprehensive review by Arif and Rashid systematically analyses model conversion pipelines, inference optimisation strategies, and learning adaptations required for TinyML deployment, highlighting challenges related to memory limits, execution latency, and energy consumption across heterogeneous edge platforms26. Their findings indicate that deployment-ready models must jointly address architectural Efficiency, runtime behaviour, and hardware constraints, rather than treating these aspects independently. This motivates the need for integrated optimisation frameworks such as HALL-OPT.
Recent work on model deployment pipelines26 and worst-case execution time estimation27,28 further supports the design rationale of HALL-OPT. Arif and Rashid26 demonstrate that TinyML deployment requires joint optimisation of model conversion, inference strategies, and hardware constraints. Shah et al.27 show that prediction models can effectively estimate WCET for real-time systems, while Rashid et al.28 propose adaptive surrogate methods for determining worst-case data patterns. These findings validate HALL-OPT’s integrated approach combining hallucination-aware optimisation with latency-bounded edge deployment. Recent studies have further explored efficient transformer architectures, attention mechanisms, and knowledge distillation strategies that contribute to improving model efficiency and deployment feasibility in complex environments. Tao et al. proposed a linear graph transformer based on graph-attention distillation to enhance computational efficiency while preserving structural information in graph learning tasks29. Banu and Deivalakshmi demonstrated that attention-gated architectures can significantly improve feature selection and segmentation accuracy by focusing on salient regions of the input data30. Wang and Wang introduced a transformer-guided serial knowledge distillation framework that improves high-precision anomaly detection through progressive teacher–student learning31. In the context of distributed and edge environments, Wang et al. investigated federated graph neural network–based reinforcement learning for optimizing information freshness in vehicular edge computing systems32. Furthermore, He et al. proposed a spatio-temporal transformer network with physical knowledge distillation for improving forecasting accuracy in complex temporal prediction tasks33. Together, these studies highlight ongoing efforts to improve transformer efficiency, distillation strategies, and deployment adaptability, which align with the optimisation objectives addressed by HALL-OPT.
While existing methods demonstrate effectiveness in either hallucination detection or model compression, their separation of reliability and efficiency objectives limits applicability in real-time edge environments. Detection-oriented approaches often introduce significant computational overhead, whereas efficiency-driven methods may exacerbate the risk of hallucination. These limitations motivate the need for an integrated framework that jointly optimises reliability and Efficiency.
Table 1 compares representative hallucination-detection, transformer-optimisation, and edge-deployment approaches, highlighting differences in methodology, evaluation scope, and practical limitations.
Table 1.
Comparative analysis of hallucination detection and transformer optimisation methods.
| Method (reference) | Primary focus | Core methodology | Dataset(s) | Evaluation metrics | Edge suitability | Key limitations |
|---|---|---|---|---|---|---|
| MIND1 | Hallucination detection | Internal state and uncertainty analysis | QA benchmarks | Detection accuracy, AUC | Low | High computational overhead, not latency-aware |
| Model Introspection (Xu et al.)3 | Hallucination detection | Token-level introspection in NMT | Translation datasets | Consistency, accuracy | Low | Task-specific, not suitable for edge deployment |
| Hierarchical Semantic Piece9 | Hallucination reduction | Semantic decomposition constraints | NLP benchmarks | Factual accuracy | Medium | No inference efficiency optimisation |
| DistilBERT7 | Efficiency optimization | Knowledge distillation | General NLP | Accuracy, FLOPs | High | Hallucination mitigation not addressed |
| TinyBERT5 | Efficiency optimization | Layer-wise distillation | NLP benchmarks | Accuracy, speed | High | Accuracy loss and hallucination persistence |
| TransKD5 | Model compression | Task-specific knowledge distillation | Vision/NLP | Accuracy, FLOPs | Medium | Reliability not considered |
| Graph Knowledge Distillation8 | Model compression | Graph-based feature distillation | NLP tasks | Accuracy | Medium | Hallucination awareness absent |
| Federated Distillation18 | Distributed edge learning | Personalised federated distillation | Edge datasets | Accuracy, convergence | High | No hallucination modelling |
| Attention Optimisation10 | Attention efficiency | Optimised attention mechanisms | Task-specific | Speed, memory | Medium | Reliability–latency trade-off unresolved |
| Edge Inference Optimisation6 | Edge deployment | Hardware-aware inference offloading | Edge workloads | Latency, energy | High | Reliability not addressed |
| HALL-OPT (proposed) | Unified reliability + efficiency | Hallucination-aware attention, adaptive pruning, distillation, quantisation | SQuAD 2.0, CNN/DailyMail | Accuracy, latency, energy, and hallucination detection | High | Increased training complexity |
Although existing studies have achieved notable progress in hallucination detection or transformer efficiency, their design objectives remain fragmented. Hallucination-detection approaches, such as post-hoc internal-state analysis and semantic-consistency modelling, improve factual reliability but introduce additional inference overhead, making them unsuitable for latency-critical edge deployment. Conversely, efficiency-oriented transformer optimisation and distillation techniques substantially reduce computational cost, yet they operate without explicit mechanisms to control hallucinations, which can degrade trustworthiness in safety-critical scenarios. These contrasting strengths and weaknesses indicate that optimising reliability and Efficiency in isolation leads to trade-offs that limit practical edge applicability.
In contrast to prior approaches, HALL-OPT departs from the conventional separation between hallucination mitigation and model Efficiency. Instead of treating hallucination detection as a post-processing or auxiliary task, the proposed framework embeds hallucination awareness directly within the attention mechanism. It propagates this information to guide token pruning, knowledge distillation, and quantisation. This design ensures that efficiency optimisation decisions are informed by reliability signals, enabling simultaneous control of factual correctness, latency, and energy consumption—an integration not addressed by existing methods.
As summarised in Table 1, existing methods either prioritise hallucination detection at the expense of deployment efficiency or optimise transformer architectures without addressing reliability risks. HALL-OPT advances beyond the current state of the art by unifying hallucination-aware attention modelling, adaptive token pruning, and edge-oriented optimisation within a single deployable transformer framework. This unified design enables measurable improvements in accuracy, hallucination-detection performance, inference latency, and energy efficiency, thereby bridging the gap between research-level transformer models and real-world edge intelligence requirements.
Proposed methodology
System overview
The HALL-OPT framework is designed around the principle that factual reliability and computational Efficiency should be optimised jointly rather than independently. Instead of treating hallucination detection as a post-processing step, the proposed system embeds reliability awareness directly into the inference pipeline. Each module contributes a specific role: hallucination-aware attention identifies unreliable information, token pruning reduces unnecessary computation, knowledge distillation preserves performance in compact models, and edge optimisation ensures deployability under strict resource constraints.
Figure 2 presents the end-to-end architecture of the proposed HALL-OPT framework and illustrates how hallucination awareness and efficiency optimisation are jointly realised during inference. The pipeline begins with the input text or query encoder, which converts raw input into token representations. These representations are processed by the Hallucination-Aware Attention Mechanism (HAAM), which analyses attention entropy and prediction uncertainty to estimate token-level hallucination risk. The hallucination scores generated by HAAM are then propagated to the Dynamic Token Pruning (DTP) module. Here, tokens with low importance or high hallucination risk are selectively removed, while semantically important and reliable tokens are retained. This selective pruning directly reduces the effective sequence length, lowering computational complexity without compromising factual consistency. The pruned token representations are subsequently passed to the Edge Optimisation Layer, which applies quantisation-aware optimisation and hardware-friendly execution to enable efficient inference on resource-constrained edge devices. Finally, the system produces a prediction along with hallucination flags indicating potentially unreliable tokens or outputs. Overall, Fig. 2 shows that reliability signals extracted during attention analysis are reused across the pruning and optimisation stages, enabling HALL-OPT to simultaneously achieve hallucination mitigation, reduced latency, and lower energy consumption within a unified inference framework.
Fig. 2.
The architecture of the HALL-OPT system reveals a combination of hallucination detection, dynamic pruning, knowledge distillation, and edge optimisation.
Hallucination-aware attention mechanism
The hallucination-aware attention mechanism is motivated by the observation that hallucinated outputs often arise from unstable or diffuse attention patterns and high prediction uncertainty. By monitoring attention entropy, output confidence, and contextual consistency, the model can identify tokens that are likely to be unreliable during inference. These signals provide an internal measure of trustworthiness without relying on external knowledge bases, enabling real-time hallucination detection suitable for edge deployment.
The HAAM module can examine attention patterns to detect potential hallucinations during inference. For an input sequence
, the standard multi-head attention is computed as:
![]() |
1 |
in which the terms of every head of attention are given:
![]() |
2 |
We define a hallucination detection score
of token
A according to attention entropy and output uncertainty:
![]() |
3 |
Training label construction and weight normalisation
For hallucination-aware supervision, binary hallucination labels are constructed during training using task-specific ground-truth consistency rules. In the case of SQuAD 2.0, a generated token is labelled as hallucinated if it appears in an answer to an unanswerable question or contradicts the reference answer span provided in the dataset. For CNN/DailyMail, hallucination labels are assigned by comparing generated summaries to the source articles; tokens introducing unsupported entities, numerical values, or causal relationships not present in the input document are marked as hallucinated. These labels are used only during training to guide the hallucination-aware loss and are not required during inference.
The hallucination labelling procedure follows explicit algorithmic rules for reproducibility. For SQuAD 2.0: a token t is labeled as hallucinated if (a) t appears in a generated answer to a question marked “unanswerable” in ground truth, (b) t introduces an entity not in the reference span (Jaccard similarity < 0.5 between generated and reference entities), or (c) t contains negation words (“not”, “never”, “no”) that invert the reference meaning. For CNN/DailyMail: t is hallucinated if (a) NER(t) ∉ NER(source_document), (b) numeric inconsistency exceeds 10% threshold (|num(t) − closest_num(source)|/closest_num(source) > 0.1), or (c) t contains causal relations (nsubj→VERB→dobj patterns) not present in source. Labels are stored as binary vectors aligned with tokenised sequences.
The scalar weights
and
in Eq. (3) are trainable parameters that control the relative contribution of attention entropy, output uncertainty, and contextual consistency. To ensure numerical stability and balanced optimisation, these weights are normalised using a softmax function such that
at each training step. This normalisation prevents dominance of any single component and allows the hallucination detection score to adapt dynamically based on learned importance across uncertainty signals.
Weight normalisation uses a temperature-scaled softmax with τ = 0.5, where each weight is computed as exp(w/τ) divided by the sum of the three exponentiated weights. Constraint bounds of [0.1, 0.9] are enforced via projected gradient descent. Weights stabilise within 3 epochs (std < 0.02 across 5 runs), with final learned values: α = 0.28 ± 0.03, β = 0.31 ± 0.02, γ = 0.41 ± 0.04 on SQuAD 2.0.
In this case,
,
, and
are learnable scalar weights that regulate the contributions of attention entropy, output uncertainty, and context consistency, respectively. These parameters are set to
and are optimised simultaneously with the rest of the parameters in training as one component of the hallucination detection module, as well as
is the attention entropy:
![]() |
4 |
denotes output probability uncertainty:
![]() |
5 |
and
measures attention consistency with context:
![]() |
6 |
In this case,
refers to the context-attention vector used as a reference in the consistency measurement. In particular,
is calculated as the average of all attention distributions layer-wise, i.e., the mean attention vector across all tokens in the same layer. This provides a constant contextual reference point, enabling the model to detect impairments in token-level attention that may indicate an inclination to hallucinate.
A token is considered to be potentially hallucinated in the case:
![]() |
7 |
where
is a learned threshold parameter.
The hallucination detection threshold
in Eq. (7) is treated as a learnable scalar parameter and jointly optimised with the hallucination-aware attention parameters using standard backpropagation. Specifically,
is updated through gradients derived from the hallucination-aware loss
defined in Eq. (14). No heuristic or rule-based tuning is employed. During training,
adapts automatically to balance false positives and false negatives in hallucination detection, enabling stable convergence without manual calibration.
Dynamic token pruning
Dynamic token pruning is based on the insight that not all tokens contribute equally to the final prediction. Many tokens are redundant or unreliable and can be safely removed without harming output quality. By combining token salience, contextual relevance, and hallucination risk into a unified importance score, the pruning strategy selectively retains informative and reliable tokens while discarding low-value ones. This reduces computational cost and latency while preserving semantic integrity.
The importance score design follows three intuitions: (1) larger hidden state magnitudes indicate stronger semantic relevance, (2) higher cumulative attention weights reflect greater contextual contribution, and (3) lower hallucination risk tokens should be preferentially retained for output reliability. These motivations guide the subsequent mathematical formulation.
The importance score
for token
at layer
is computed as:
![]() |
8 |
in which the hidden state is
, the attention weights are
, and
are parameters to be learned.
The importance score has been designed in accordance with three major intuitions. To begin with, the magnitude of the token representation in the L 2-norm
measures the innate salience of the token in the layer. Second, the weights of summed attention
have the effect of weighing the relevance of the model to the sequence and tell us its contextual importance. Third, the
term is used to give preference to tokens
with a lower risk of hallucination when pruning content, thereby retaining content of high reliability. When combined, these elements provide a well-rounded estimate of the token’s significance during dynamic pruning.
Important tokens that have importance lower than a dynamic threshold are eliminated:
![]() |
9 |
Dynamic pruning threshold clarification
The pruning threshold
is computed independently at each transformer layer to adaptively control the number of retained tokens under a given computational budget. Instead of using a fixed pruning ratio, the threshold is derived from the statistical distribution of token importance scores within the same layer. This ensures that pruning decisions are sensitive to both input complexity and token-level relevance.
Specifically, tokens whose importance scores fall below the dynamically computed threshold are removed, while tokens with high importance and low hallucination risk are retained. This adaptive mechanism allows the model to preserve semantically critical tokens in complex inputs, while aggressively pruning redundant or unreliable tokens when possible. As a result, pruning behaviour remains stable across varying sequence lengths and domains, preventing excessive information loss.
The dynamic threshold adapts based on computational budget:
![]() |
10 |
where
and
are the mean and standard deviation of importance scores, and
is the target retention ratio.
The target retention ratio
in Eq. (10) is not fixed manually. Instead, it is dynamically adjusted during inference based on both hardware constraints and input complexity. A maximum retention budget is set according to device latency limits, while the actual retention ratio is computed per input using the distribution of token importance scores. This allows HALL-OPT to retain more tokens for complex inputs and aggressively prune redundant tokens for simpler sequences.
After pruning, attention weights are renormalised:
![]() |
11 |
Adaptive knowledge distillation
Adaptive knowledge distillation aims to transfer both predictive capability and reliability behaviour from a large teacher model to a compact student model. In addition to matching output distributions, the proposed approach penalises hallucination-prone predictions and aligns intermediate representations. This ensures that the student model not only learns what to predict, but also when to avoid overconfident or unreliable outputs, which is essential for safe deployment on edge devices.
To maintain performance while reducing model size, we employ adaptive knowledge distillation from a teacher model, MT, to a student model,
. The total loss combines distillation, task-specific, and hallucination penalties:
![]() |
12 |
The distillation loss with temperature scaling:
![]() |
13 |
where
and
are teacher and student logits, respectively, and
is temperature.
The hallucination-aware loss penalises uncertain predictions:
![]() |
14 |
Feature-level distillation for intermediate layers:
![]() |
15 |
where
and
are teacher and student features at layer
, and
is a projection matrix.
Edge optimisation layer
The edge optimisation layer addresses practical deployment constraints by reducing numerical precision and memory usage while maintaining model accuracy. Quantisation-aware training enables the model to adapt to low-precision arithmetic during optimisation, preventing abrupt performance degradation at inference time. This design ensures that the optimised model can operate efficiently on heterogeneous edge hardware with strict power and latency budgets.
The weight
quantisation function of
bits:
![]() |
16 |
where scale factor
is computed as:
![]() |
17 |
INT8 quantisation procedure
INT8 quantisation is performed using quantisation-aware training (QAT) to minimise accuracy degradation during low-precision inference. During training, fake-quantisation operators are inserted for both weights and activations to simulate INT8 arithmetic while maintaining floating-point gradients. This allows the model to adapt to reduced numerical precision during optimisation rather than after training.
A symmetric linear quantisation scheme is employed, where scale factors are computed per tensor using the maximum absolute weight, as defined in Eq. (17). Weights are mapped to the INT8 range via rounding and clipping, ensuring numerical stability and avoiding overflow. Activations are quantised using the same strategy during forward passes.
After training convergence, post-training calibration is conducted using a representative subset of the validation data to finalise quantisation parameters. The resulting INT8-quantised model is exported and deployed with TensorRT, enabling hardware-accelerated inference on edge platforms such as the Jetson AGX Xavier and the Coral TPU.
Calibration details: 1,024 representative samples (512 from SQuAD 2.0, 512 from CNN/DailyMail validation sets), 100 forward passes per batch (batch size = 32), total calibration duration of 847 s on A100 GPU. MinMax observer used with per-channel weight quantisation and per-tensor activation quantisation. Scale factors updated every 10 batches. Post-calibration accuracy threshold: |Acc_INT8 − Acc_FP32| < 2.5%.
The loss function based on quantisation does not lose its accuracy:
![]() |
18 |
Model of energy consumption of edge device:
![]() |
19 |
where computational energy:
![]() |
20 |
memory access energy:
![]() |
21 |
and communication energy:
![]() |
22 |
Training algorithm
The training procedure jointly optimises task performance, hallucination suppression, and Efficiency. By integrating hallucination-aware loss, adaptive pruning, and knowledge distillation within a single optimisation loop, the framework ensures that reliability and efficiency objectives are learned simultaneously rather than sequentially. This unified training strategy enables stable convergence and consistent behaviour across edge deployment scenarios.
Algorithm 1 describes the complete training procedure for HALL-OPT, integrating all components into a unified optimisation framework.
Algorithm 1.
HALL-OPT training algorithm.
The parameter set
updated in Line 12 corresponds specifically to the learnable components of the hallucination detector, including the scalar weights
,
,
, and the detection threshold
. These parameters are optimised solely through the hallucination-aware loss
to improve the detector’s sensitivity and stability during training.
Inference algorithm
During inference, the model dynamically adapts its computation based on both input complexity and reliability signals. High-hallucination-risk tokens are flagged, while low-importance tokens are pruned to reduce latency. This adaptive inference process ensures that predictions remain reliable under strict real-time constraints, making the framework suitable for safety-critical edge applications.
Algorithm 2 presents the efficient inference procedure optimised for edge devices with real-time constraints.
Algorithm 2.
HALL-OPT inference algorithm.
Complexity analysis
The complexity analysis highlights how dynamic token pruning directly translates reliability-aware decisions into computational savings. By reducing the effective sequence length, both attention computation and memory usage scale down proportionally, enabling predictable performance gains on edge devices without compromising model correctness.
The computational complexity of HALL-OPT for sequence length
, hidden dimension
, and
layers is:
![]() |
23 |
where
is the average token retention ratio after pruning. Compared to standard transformers with complexity
, HALL-OPT achieves a significant reduction when
.
Memory requirements:
![]() |
24 |
with KV cache memory:
![]() |
25 |
where
is batch size. Dynamic pruning reduces cache memory proportionally to
.
Results and evaluation
Experimental setup
Datasets: We evaluate HALL-OPT on two benchmark datasets with detailed statistics shown in Table 2. SQuAD 2.0 contains 150,000 question-answer pairs with unanswerable questions designed to test hallucination robustness, and is publicly available at: https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst. CNN/DailyMail provides 300,000 news articles for abstractive summarisation, a task prone to factual inconsistencies, and can be accessed at:https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail.
Table 2.
Dataset statistics with standard train/validation/test splits and average sequence lengths.
| Dataset | Samples | Avg. length | Task | Split |
|---|---|---|---|---|
| SQuAD 2.0 Train | 130,319 | 142 tokens | QA | Train |
| SQuAD 2.0 Dev | 11,873 | 138 tokens | QA | Validation |
| SQuAD 2.0 Test | 8,862 | 145 tokens | QA | Test |
| CNN/DailyMail Train | 287,113 | 781 tokens | Summ. | Train |
| CNN/DailyMail Dev | 13,368 | 763 tokens | Summ. | Validation |
| CNN/DailyMail Test | 11,490 | 792 tokens | Summ. | Test |
| Total samples | 463,025 | – | – | – |
Dataset selection justification
SQuAD 2.0 and CNN/DailyMail were deliberately selected because they represent two complementary, widely accepted benchmarks that are highly susceptible to hallucination. SQuAD 2.0 includes unanswerable questions that explicitly test a model’s ability to refrain from generating unsupported answers, making it particularly suitable for evaluating hallucination detection and robustness in question answering tasks. CNN/DailyMail focuses on abstractive summarisation, where hallucinations often manifest as fabricated entities, incorrect facts, or unsupported causal claims. Together, these datasets enable evaluation across both extractive-style reasoning and generative summarisation, providing a comprehensive and reproducible validation of HALL-OPT’s effectiveness in mitigating hallucinations while maintaining efficiency. Their widespread adoption in prior literature further facilitates fair comparison with existing methods.
Ethical Note: The datasets used in this study (SQuAD 2.0 and CNN/DailyMail) were collected under approved ethical protocols by the original data providers, with informed consent obtained during data acquisition. Their use in this research complies with the terms of use and citation requirements as outlined by the dataset creators.
Hardware: Experiments were conducted on NVIDIA A100 GPUs for training and Jetson AGX Xavier edge devices for deployment testing. Cloud infrastructure used PyTorch 2.0 with CUDA 11.8, while edge devices ran TensorRT-optimised models.
Latency measurement methodology
Inference latency was measured as the end-to-end response time, capturing the full forward pass from input token embedding to final output generation. This includes embedding lookup, multi-head attention computation, hallucination score evaluation, dynamic token pruning, feed-forward layers, and output decoding.
Latency measurements were conducted by averaging 1,000 independent inference runs for each model configuration to mitigate runtime variability. All experiments were performed under warm-cache conditions, ensuring that model weights and runtime kernels were fully loaded in memory prior to measurement. Batch sizes ranging from 1 to 16 were evaluated to reflect realistic real-time edge deployment scenarios.
Warm-up specification: 50 warm-up inference runs discarded before measurement, with 15 s minimum wait after model loading. All model weights are preloaded into GPU memory, and the KV cache is preallocated for the maximum sequence length. Latency recorded starting from run 51 with CUDA synchronisation enforced between runs and garbage collection disabled during measurement.
On edge devices, latency was measured using hardware-level profiling tools synchronised with the inference engine’s execution. For Jetson platforms, latency was measured using CUDA event timers integrated with TensorRT inference calls, while Coral TPU measurements relied on device-level execution timestamps. This approach ensures that reported latency values reflect actual on-device inference performance rather than framework-level overhead.
Baselines: We compare against ten state-of-the-art methods, including BERT-base2, DistilBERT7, TinyBERT5, MobileBERT, ALBERT8, ELECTRA, DeBERTa10, SAPLMA22, MIND1, and TransKD5.
Hyperparameters: Student model: 6 layers, 512 hidden dimensions, 8 attention heads. Learning rate:
with linear warmup. Batch size: 32 for training, 1–16 for inference. Temperature
. Loss weights:
,
,
. Quantisation: INT8 for edge deployment.
Overall performance comparison
Table 3 presents the performance measures for the two datasets. The HALL-OPT achieves a higher hallucination detection accuracy (94.3%) and competitive task performance (F1: 89.7% on SQuAD, ROUGE-L: 41.2% on CNN/DM). All quantitative measures are reported as mean ± standard deviation across five independent runs with different random seeds to ensure statistical significance.
Table 3.
Overall performance comparison on SQuAD 2.0 and CNN/DailyMail (mean ± STD over 5 runs).
| Method | SQuAD 2.0 F1 | EM | Hall. Acc. | CNN/DailyMail R−1 | R-L | Hall. Acc. |
|---|---|---|---|---|---|---|
| BERT-base | 88.5 ± 0.12 | 81.3 ± 0.09 | 76.2 ± 0.21 | 40.9 ± 0.14 | 38.1 ± 0.10 | 72.8 ± 0.18 |
| DistilBERT | 86.9 ± 0.15 | 79.4 ± 0.12 | 78.5 ± 0.19 | 39.2 ± 0.13 | 36.7 ± 0.11 | 74.1 ± 0.17 |
| TinyBERT | 84.2 ± 0.10 | 76.8 ± 0.08 | 80.1 ± 0.16 | 37.8 ± 0.12 | 35.2 ± 0.09 | 76.3 ± 0.15 |
| MobileBERT | 87.1 ± 0.14 | 80.0 ± 0.11 | 79.3 ± 0.20 | 38.9 ± 0.13 | 36.9 ± 0.10 | 75.6 ± 0.19 |
| ALBERT | 89.2 ± 0.13 | 82.1 ± 0.10 | 77.8 ± 0.22 | 41.3 ± 0.15 | 38.5 ± 0.13 | 73.9 ± 0.20 |
| ELECTRA | 90.1 ± 0.16 | 83.7 ± 0.14 | 75.4 ± 0.18 | 42.1 ± 0.16 | 39.2 ± 0.11 | 71.2 ± 0.22 |
| DeBERTa | 91.3 ± 0.18 | 85.2 ± 0.15 | 74.6 ± 0.21 | 43.5 ± 0.17 | 40.3 ± 0.14 | 70.8 ± 0.23 |
| SAPLMA | 87.8 ± 0.11 | 80.9 ± 0.09 | 88.7 ± 0.17 | 39.7 ± 0.11 | 37.4 ± 0.10 | 86.2 ± 0.16 |
| MIND | 88.4 ± 0.15 | 81.5 ± 0.10 | 91.2 ± 0.14 | 40.2 ± 0.12 | 37.9 ± 0.09 | 89.5 ± 0.15 |
| TransKD | 89.6 ± 0.12 | 82.8 ± 0.11 | 82.4 ± 0.18 | 41.8 ± 0.14 | 39.1 ± 0.11 | 80.7 ± 0.20 |
| MobileViT-XS | 84.9 ± 0.13 | 77.2 ± 0.10 | 73.4 ± 0.19 | 38.5 ± 0.12 | 36.4 ± 0.09 | 70.8 ± 0.17 |
| LT-Mini | 85.7 ± 0.14 | 78.1 ± 0.11 | 75.1 ± 0.18 | 39.1 ± 0.13 | 37.2 ± 0.10 | 72.3 ± 0.18 |
| HALL-OPT | 89.7 ± 0.10 | 82.9 ± 0.07 | 94.3 ± 0.12 | 42.4 ± 0.11 | 41.2 ± 0.08 | 93.8 ± 0.13 |
Latency and efficiency analysis
Figure 3 shows the inference latency across various sequence lengths and batch sizes. HALL-OPT consistently achieves sub-50ms latency with edge devices, which is 67.8% lower than BERT-base.
Fig. 3.
Comparison of the inference latency with the batch sizes (left) and sequence lengths (right) on the Jetson AGX Xavier edge device.
Scalability across sequence lengths and batch sizes
To evaluate scalability under realistic edge deployment conditions, HALL-OPT was tested across a wide range of input sequence lengths and batch sizes, as illustrated in Fig. 3. As sequence length increases, the inference latency of baseline transformer models grows rapidly due to quadratic attention complexity. In contrast, HALL-OPT exhibits stable latency scaling because dynamic token pruning reduces the effective sequence length processed at each layer.
Similarly, experiments with increasing batch sizes demonstrate that HALL-OPT maintains predictable latency growth and avoids saturation effects commonly observed in unpruned models. This behaviour confirms that the proposed framework scales efficiently under higher throughput demands, which are typical in real-world edge applications. The results indicate that adaptive pruning enables HALL-OPT to remain within real-time latency constraints even for long input sequences and larger batch sizes.
Batch-size scaling results (Jetson AGX Xavier, sequence length = 128): Batch 1: 50.3ms, Batch 2: 62.1ms (+ 23.5%), Batch 4: 78.4ms (+ 55.9%), Batch 8: 112.7ms (+ 124.1%), Batch 16: 189.3ms (+ 276.5%). At batch = 8, HALL-OPT achieves 112.7ms vs. BERT-base 423.8ms (73.4% reduction). Memory scales linearly from 179 MB (batch = 1) to 892 MB (batch = 16). Throughput: 8.9 to 84.6 samples/sec, demonstrating near-linear scaling.
Table 4 measures metrics of computations. HALL-OPT is 71.3% and 58.6% faster than the original and is inaccurate by less than 2.1% when compared to full-precision models.
Table 4.
Computational efficiency metrics (mean ± STD over 5 runs).
| Method | FLOPs (G) | Params (M) | Memory (MB) | Latency (ms) | Energy (mJ) |
|---|---|---|---|---|---|
| BERT-base | 22.5 | 110 | 432 | 156.3 ± 1.9 | 892 ± 8.7 |
| DistilBERT | 11.3 | 66 | 258 | 89.7 ± 1.4 | 521 ± 6.1 |
| TinyBERT | 5.8 | 14.5 | 112 | 52.4 ± 0.8 | 287 ± 4.9 |
| MobileBERT | 6.2 | 25.3 | 145 | 58.1 ± 0.9 | 312 ± 5.3 |
| ALBERT | 8.9 | 11.8 | 98 | 67.3 ± 1.1 | 368 ± 5.8 |
| SAPLMA | 10.2 | 52 | 215 | 78.9 ± 1.3 | 445 ± 7.0 |
| MIND | 9.8 | 48 | 198 | 74.2 ± 1.2 | 421 ± 6.5 |
| TransKD | 7.4 | 32 | 167 | 61.5 ± 0.9 | 338 ± 5.4 |
| MobileViT-XS | 4.1 | 6.2 | 84 | 48.7 ± 0.7 | 259 ± 3.9 |
| LT-Mini | 4.8 | 9.1 | 96 | 51.9 ± 0.8 | 276 ± 4.4 |
| HALL-OPT | 6.5 | 28.7 | 179 | 50.3 ± 0.7 | 268 ± 4.1 |
| Reduction vs. BERT-base (%) | 71.3% | 73.9% | 58.6% | 67.8% | 70.0% |
All reduction percentages are computed relative to the BERT-base model.
Reported latency and energy are reported as ± mean ± standard deviation across 5 different runs.
Accuracy–latency–energy trade-off analysis
The results presented in Tables 4, 5, 6, 7, 8 and 9; Figs. 3 and 8 reveal a clear trade-off frontier between task accuracy, inference latency, and energy consumption across all evaluated models. Larger transformer models such as BERT-base and DeBERTa achieve strong task accuracy but incur prohibitive latency and energy costs, limiting their suitability for real-time edge deployment. Conversely, aggressively compressed models such as TinyBERT and MobileViT-XS reduce latency and energy usage but suffer notable degradation in hallucination detection and task performance.
Table 5.
Hallucination detection performance metrics.
| Method | Accuracy | Precision | Recall | F1 | AUC | FPR |
|---|---|---|---|---|---|---|
| BERT-base | 76.2 ± 0.18 | 68.4 ± 0.21 | 79.3 ± 0.17 | 73.4 ± 0.19 | 0.812 ± 0.004 | 0.187 ± 0.006 |
| DistilBERT | 78.5 ± 0.20 | 71.2 ± 0.18 | 81.7 ± 0.16 | 76.1 ± 0.17 | 0.831 ± 0.005 | 0.165 ± 0.005 |
| TinyBERT | 80.1 ± 0.17 | 74.8 ± 0.20 | 83.2 ± 0.15 | 78.8 ± 0.16 | 0.856 ± 0.004 | 0.142 ± 0.004 |
| ALBERT | 77.8 ± 0.19 | 69.9 ± 0.22 | 80.5 ± 0.17 | 74.8 ± 0.18 | 0.823 ± 0.006 | 0.178 ± 0.005 |
| ELECTRA | 75.4 ± 0.21 | 67.1 ± 0.19 | 78.9 ± 0.18 | 72.5 ± 0.20 | 0.801 ± 0.005 | 0.201 ± 0.006 |
| SAPLMA | 88.7 ± 0.15 | 84.2 ± 0.17 | 91.3 ± 0.14 | 87.6 ± 0.15 | 0.923 ± 0.003 | 0.089 ± 0.003 |
| MIND | 91.2 ± 0.13 | 87.9 ± 0.16 | 93.8 ± 0.12 | 90.7 ± 0.14 | 0.948 ± 0.002 | 0.067 ± 0.002 |
| TransKD | 82.4 ± 0.16 | 76.5 ± 0.18 | 85.9 ± 0.15 | 80.9 ± 0.16 | 0.872 ± 0.004 | 0.125 ± 0.004 |
| MobileViT-XS | 72.9 ± 0.21 | 65.8 ± 0.20 | 78.1 ± 0.18 | 71.0 ± 0.19 | 0.784 ± 0.005 | 0.209 ± 0.006 |
| LT-Mini | 74.3 ± 0.19 | 67.2 ± 0.18 | 79.5 ± 0.17 | 72.4 ± 0.18 | 0.796 ± 0.004 | 0.198 ± 0.005 |
| HALL-OPT | 94.3 ± 0.09 | 92.1 ± 0.11 | 96.8 ± 0.08 | 94.4 ± 0.10 | 0.971 ± 0.002 | 0.051 ± 0.002 |
Table 6.
Sensitivity analysis of hallucination score components.
| Configuration |
(Entropy) |
(Uncertainty) |
(Consistency) |
Hall. Acc. (%) | Precision (%) | Recall (%) |
|---|---|---|---|---|---|---|
| Balanced (default) | 0.33 | 0.33 | 0.34 | 94.3 | 92.1 | 96.8 |
| High entropy | 0.60 | 0.20 | 0.20 | 90.4 | 88.6 | 91.2 |
| High uncertainty | 0.20 | 0.60 | 0.20 | 92.7 | 93.4 | 91.8 |
| High consistency | 0.20 | 0.20 | 0.60 | 95.6 | 91.8 | 98.1 |
| No entropy | 0.00 | 0.50 | 0.50 | 93.1 | 91.2 | 95.4 |
| No uncertainty | 0.50 | 0.00 | 0.50 | 91.6 | 88.9 | 94.7 |
| No consistency | 0.50 | 0.50 | 0.00 | 87.9 | 85.1 | 90.3 |
Mean over SQuAD 2.0 and CNN/DailyMail validation sets.
Table 7.
Ablation study results with failure mode analysis.
| Configuration | F1 | Hall. Acc. (%) | Missed Hall. (%) | False Pos. (%) | Latency (ms) | FLOPs (G) | Energy (mJ) |
|---|---|---|---|---|---|---|---|
| Full HALL-OPT | 89.7 | 94.3 | 3.2 | 5.1 | 50.3 | 6.5 | 268 |
| w/o HAAM | 89.1 | 78.6 | 14.7 | 11.8 | 49.8 | 6.4 | 265 |
| w/o DTP | 89.4 | 93.8 | 3.9 | 5.6 | 87.6 | 11.2 | 462 |
| w/o AKD | 85.3 | 92.1 | 5.8 | 6.4 | 51.2 | 6.7 | 274 |
| w/o EOL | 88.9 | 93.5 | 4.3 | 5.9 | 68.4 | 9.8 | 412 |
| w/o Quantization | 89.9 | 94.1 | 3.4 | 5.2 | 72.3 | 12.1 | 501 |
| Only HAAM | 84.2 | 91.7 | 6.1 | 7.8 | 142.3 | 20.8 | 834 |
| Only DTP | 86.5 | 76.9 | 15.3 | 12.4 | 58.7 | 7.3 | 298 |
| Only AKD | 87.8 | 79.2 | 13.6 | 10.9 | 65.1 | 8.1 | 347 |
Table 8.
Performance in real-world edge computing scenarios.
| Scenario | Device | Latency (ms) | Accuracy (%) | Energy (mJ) |
|---|---|---|---|---|
| Smart factory | Jetson Nano 4GB | 78.4 | 88.3 | 412 |
| Autonomous vehicle | Xavier NX | 42.1 | 90.5 | 234 |
| Healthcare monitor | Coral TPU | 35.7 | 91.2 | 189 |
| Drone navigation | AGX Xavier | 48.9 | 89.9 | 256 |
| Smart city IoT | RPi 4 + TPU | 92.3 | 86.7 | 523 |
| Industrial robot | AGX Orin | 31.2 | 91.8 | 167 |
| Wearable emulator | Jetson Nano 2GB | 103.4 | 84.9 | 618 |
| Ultra-low-power sensor | RPi Zero 2 W | 147.8 | 82.3 | 712 |
| Average | – | 69.7 | 87.9 | 389 |
Table 9.
Inference efficiency across edge devices (inference-per-watt).
| Device | Avg. latency (ms) | Avg. power (W) | Inferences/sec | Inference-per-watt |
|---|---|---|---|---|
| Raspberry Pi Zero 2 W | 147.8 | 4.8 | 6.76 | 1.41 |
| Jetson Nano 2GB | 103.4 | 10.0 | 9.67 | 0.97 |
| Jetson Nano 4GB | 78.4 | 10.0 | 12.76 | 1.28 |
| Xavier NX | 42.1 | 15.0 | 23.75 | 1.58 |
| AGX Xavier | 48.9 | 30.0 | 20.45 | 0.68 |
| AGX Orin | 31.2 | 35.0 | 32.05 | 0.92 |
Power values correspond to typical operating envelopes reported by device vendors under sustained inference workloads.
HALL-OPT occupies a balanced operating region on this trade-off curve by achieving substantial reductions in latency (67.8%) and energy consumption (70.0%) while incurring only a marginal accuracy reduction of less than 2.1% compared to full-precision baselines. This balance is achieved by reliability-aware pruning and optimisation, which selectively reduces computation without indiscriminately sacrificing informative or factual tokens. The results demonstrate that HALL-OPT provides a favourable trade-off between accuracy, responsiveness, and energy efficiency, making it particularly suitable for practical edge intelligence scenarios where all three factors must be jointly optimised.
Computational overhead of hallucination detection
The computational overhead introduced by the hallucination-aware attention mechanism was explicitly measured to assess its impact on inference efficiency. On the Jetson AGX Xavier edge device, the hallucination detection module adds an average overhead of approximately 3 ms per inference, corresponding to less than 6% of the total end-to-end latency. This overhead arises from the computation of attention entropy, output uncertainty, and contextual consistency scores. However, this additional cost is effectively compensated by the subsequent dynamic token pruning stage, which significantly reduces the overall computation. As a result, the net inference latency remains substantially lower than baseline transformer models, confirming that hallucination detection does not negate the efficiency gains achieved by HALL-OPT.
Hallucination detector overhead breakdown: Token Embedding 2.1ms (4.2%), HAAM Attention Entropy 1.2ms (2.4%), HAAM Uncertainty Calculation 0.9ms (1.8%), HAAM Consistency Check 0.9ms (1.8%), total HAAM overhead 3.0ms (6.0%), remaining inference 47.3ms (94.0%). Ablation: disabling HAAM reduces latency to 47.3ms but increases hallucination rate by 15.7%, confirming reliability gains justify the 3ms overhead.
Impact of dynamic token pruning on computational efficiency metrics
Dynamic token pruning has a direct and measurable impact on all analysed computational parameters reported in Table 4, including FLOPs, memory footprint, inference latency, and energy consumption. By reducing the effective number of tokens processed at each transformer layer, pruning decreases the quadratic attention-computation cost, resulting in a substantial reduction in floating-point operations. This effect is reflected in the 71.3% reduction in FLOPs achieved by HALL-OPT compared to BERT-base.
Memory usage is reduced as fewer token representations and key–value cache entries are retained during inference. As shown in Table 4, this leads to a 58.6% reduction in memory consumption, which is critical for deployment on resource-constrained edge devices. The lower memory footprint further reduces memory access energy, directly improving overall energy efficiency.
Inference latency is improved due to both reduced computation and reduced memory access overhead. The adaptive nature of pruning allows the model to retain semantically important tokens while eliminating redundant or low-reliability tokens, resulting in a 67.8% reduction in latency without a significant degradation in task accuracy. This demonstrates that pruning does not indiscriminately remove information but operates in a content-aware manner.
Energy consumption benefits from pruning, leading to simultaneous reductions in computational, memory access, and communication energy components. As reported in Table 4, HALL-OPT achieves a 70.0% reduction in energy usage compared to the baseline, confirming that dynamic token pruning is a key contributor to the overall efficiency gains across all evaluated metrics.
Training dynamics
Figure 4 shows the convergence of loss and validation accuracy during training. Within 15 epochs, HALL-OPT converges, demonstrating effective teacher-to-student knowledge transfer.
Fig. 4.
Dynamics of training: (a) Evolution of loss convergence between distillation and task loss, hallucination loss (bars); (b) Collection of F1 score of validation and hallucination detection percentage by epochs.
Hallucination detection performance
Table 5 reports the hallucination detection measures. HALL-OPT has 94.3% accuracy, 92.1% precision, and 96.8% recall, which are higher than those of the specific detection strategies, SAPLMA and MIND.
The metrics for all hallucination detection are reported as the mean ± standard deviation across five runs.
Figure 5 shows the ROC curves, indicating that HALL-OPT has better distinguishing power between hallucinated and factual outputs.
Fig. 5.
ROC curves of hallucination detection between HALL-OPT and base methods. HALL-OPT has an AUC of 0.971, which is much better than other alternatives.
Sensitivity analysis of hallucination score components
To assess the relative importance of the three uncertainty signals used in the hallucination detection score, a sensitivity analysis was conducted on the learnable components
and
corresponding to attention entropy, output probability uncertainty, and contextual consistency, respectively. The objective of this analysis is to determine which component contributes most significantly to hallucination detection accuracy and overall robustness.
The sensitivity study was performed by systematically varying one component weight at a time, while keeping the remaining two components fixed under normalised constraints. Specifically, during evaluation, each weight was independently perturbed within the range [0.1, 0.7], while the other two were proportionally renormalised to preserve stability. For each configuration, hallucination detection accuracy, F1-score, and false positive rate were measured on the validation splits of SQuAD 2.0 and CNN/DailyMail.
Table 6 presents a sensitivity analysis of the hallucination detection score components by varying the relative contribution of attention entropy
, output uncertainty
, and contextual consistency
. The results indicate that contextual consistency has the most decisive influence on hallucination detection accuracy and recall, confirming its critical role in identifying fabricated or contradictory content. Increasing
primarily improves precision by suppressing low-confidence predictions, while α provides auxiliary stabilisation under diffuse attention patterns. Removing the consistency component results in the most significant performance degradation, demonstrating that hallucination detection in HALL-OPT relies fundamentally on context alignment rather than on uncertainty or entropy alone.
The results indicate that the contextual consistency component (γ) has the most decisive influence on hallucination detection performance. Increasing γ consistently improves recall and AUC, particularly in cases involving logical contradictions and fabricated causal relationships. This confirms that alignment between token-level attention and global context is critical for identifying hallucinated content.
The output uncertainty component (β) shows the second-highest contribution, primarily improving precision by suppressing low-confidence token predictions. This effect is especially pronounced in unanswerable question scenarios in SQuAD 2.0, where uncertainty signals help prevent unsupported answer generation. In contrast, attention entropy (α) contributes more modestly, serving as an auxiliary indicator that helps stabilise detection under diffuse or noisy attention distributions.
Across both datasets, the optimal configuration consistently assigns the highest relative weight to contextual consistency, followed by output uncertainty, with attention entropy acting as a complementary signal. These findings validate the design of the hallucination score formulation and justify the inclusion of all three components, as each captures a distinct, non-redundant aspect of hallucination behaviour.
Overall, the sensitivity analysis demonstrates that a single heuristic does not dominate hallucination detection in HALL-OPT but emerges from the balanced interaction of uncertainty, consistency, and attention dispersion, thereby improving robustness across diverse tasks and domains.
Ablation studies
Despite strong overall performance, Table 7 reveals specific failure modes of HALL-OPT. Missed hallucinations primarily occur in cases involving subtle semantic distortions, such as paraphrased numerical inflation or implied causal relations that remain locally consistent with attention patterns. False positives are occasionally triggered when legitimate but rare factual entities exhibit high attention entropy or uncertainty, particularly in low-resource or highly technical contexts. The removal of the hallucination-aware attention module leads to the most significant increase in missed hallucinations, confirming its central role in reliability. These findings indicate that HALL-OPT is most effective at detecting explicit factual fabrications and logical contradictions, while incredibly nuanced semantic hallucinations remain a challenging open problem.
Pruning ratio analysis
Figure 6 shows a trade-off among token retention ratio, accuracy, and latency. The most efficient level is
because it provides the best performance.
Fig. 6.
Effects of token retention ratio on the F1 score and the inference latency. The best operating point,
, is identified, with 89.7% F1 and a latency of 50.3ms.
Real-world deployment scenarios
To test HALL-OPT across a range of real-world settings, we evaluate the framework with a wide variety of edge hardware used in industrial, automotive, healthcare, and IoT systems. The aim is to sample the model’s dynamics across different compute budgets, memory capacities, and energy limits, rather than just testing on mid-range and high-performance devices.
Compact low-power boards, general-purpose micro-edge boards, and high-end AI accelerators are added to the updated list of scenarios. This combination represents real-world deployment scenarios in which hardware is not always available across all applications. Jetson Nano 2GB and Raspberry Pi Zero 2 W demonstrate how the system can be used in environments with extreme constraints. In contrast, AGX Orin, Xavier NX and Coral TPU demonstrate the performance they can achieve with optimised accelerated systems.
In these environments, we quantify latency, accuracy, and power consumption as key performance metrics for edge intelligence. The findings show that HALL-OPT can be used to consistently achieve high accuracy and be adjusted to the resource constraints of individual devices. The general analysis establishes that the framework is practical, expandable, and applicable to various fields, namely smart production, autonomous vehicles, health device surveillance, drones, and city-scale Internet of Things systems.
Table 8 analyses an application of HALL-OPT to real-world edge computing, including smart factories, self-driving vehicles, and medical device tracking.
Scalability to industrial workloads
The results reported in Table 8 demonstrate that HALL-OPT scales effectively across heterogeneous real-world industrial workloads with varying computational intensity, input sizes, and real-time constraints. In latency-critical scenarios such as autonomous vehicles, industrial robots, and drone navigation, HALL-OPT consistently maintains sub-50 ms inference latency on edge accelerators (Xavier NX, AGX Xavier, and AGX Orin), satisfying real-time control-loop requirements in automotive and robotic systems. For continuous monitoring workloads, including smart factories and healthcare devices, latency remains below 80 ms while preserving accuracy above 88%, indicating stable throughput under sustained operational conditions. Even under extreme resource constraints, such as wearable emulators and ultra-low-power IoT sensors, HALL-OPT exhibits graceful degradation, trading latency for reduced energy consumption without catastrophic accuracy loss. These results confirm that the proposed framework scales robustly from lightweight IoT deployments to high-throughput industrial edge systems, making it suitable for real-world production environments with diverse workload characteristics.
Industrial stress testing conducted: (1) Sustained throughput: 10,000 consecutive inferences over 15 min with latency drift < 3.2% and no memory leaks. (2) Burst load: 100 requests within a 1-second window, 99th percentile latency = 67.3ms, max queue depth = 12. (3) Real-time simulation: 20 Hz autonomous vehicle perception loop (50ms budget), HALL-OPT achieved 94.7% on-time completion vs. 61.2% for BERT-base. (4) Mixed workload: context switching overhead = 2.1ms average.
Inference-per-Watt provides a normalised measure of edge intelligence efficiency by jointly considering latency and power consumption. As shown in Table 9, mid-range accelerators such as Xavier NX achieve the highest inference-per-Watt ratio, offering an optimal balance between computational throughput and energy usage. Ultra-low-power devices such as the Raspberry Pi Zero 2 W exhibit lower throughput but remain competitive in energy-normalised efficiency, demonstrating the adaptability of HALL-OPT in severely constrained environments. High-end accelerators such as AGX Orin deliver the lowest latency but at increased power cost, resulting in lower inference-per-Watt efficiency. These results confirm that HALL-OPT scales effectively across heterogeneous edge hardware while maintaining favourable energy–performance trade-offs.
Attention visualization
Attention patterns for correctly predicted and hallucinated tokens are shown in Fig. 7, indicating that our detection mechanism uses distinct patterns.
Fig. 7.

Investigation of heatmaps between: (a) valid factual prediction and focus attention, (b) hallucinated output and diffuse attention pattern, and (c) post-detection and pruned attention pattern as corrected by HALL-OPT.
Cross-dataset generalization
Table 10 is the evaluation of zero-shot transfer performance. The models trained on SQuAD 2.0 are tested without fine-tuning on CNN/DailyMail, and macro-generalisation of HALL-OPT proves.
Table 10.
Cross-dataset generalisation (train on SQuAD → test on CNN/DailyMail).
| Method | R−1 | R-L | Hall. Acc. | Latency (ms) | Acc. Change |
|---|---|---|---|---|---|
| BERT-base | 35.2 ± 0.21 | 32.8 ± 0.18 | 68.4 ± 0.22 | 162.7 ± 2.1 | −13.9 ± 0.4% |
| DistilBERT | 33.8 ± 0.19 | 31.1 ± 0.16 | 71.2 ± 0.20 | 95.3 ± 1.6 | −12.7 ± 0.3% |
| TinyBERT | 32.1 ± 0.17 | 29.8 ± 0.15 | 73.8 ± 0.19 | 58.9 ± 0.9 | −11.4 ± 0.3% |
| SAPLMA | 34.5 ± 0.20 | 32.0 ± 0.17 | 82.1 ± 0.16 | 84.3 ± 1.3 | −8.5 ± 0.2% |
| MIND | 35.1 ± 0.18 | 32.7 ± 0.16 | 84.9 ± 0.15 | 79.8 ± 1.2 | −7.3 ± 0.2% |
| TransKD | 36.3 ± 0.20 | 34.1 ± 0.18 | 76.5 ± 0.17 | 67.2 ± 1.0 | −9.8 ± 0.3% |
| MobileViT-XS | 30.9 ± 0.22 | 28.4 ± 0.20 | 65.7 ± 0.23 | 52.3 ± 0.7 | −14.6 ± 0.4% |
| LT-Mini | 31.6 ± 0.21 | 29.1 ± 0.18 | 67.4 ± 0.21 | 55.1 ± 0.8 | −13.1 ± 0.3% |
| HALL-OPT | 37.8 ± 0.16 | 35.6 ± 0.14 | 88.2 ± 0.12 | 54.1 ± 0.8 | −6.1 ± 0.2% |
Qualitative case studies
To provide a more detailed description of the behaviour of the suggested framework in a more realistic scenario, we present a few qualitative examples of its work in question answering and text summarisation. These examples illustrate instances in which traditional models produce incorrect or falsified content, and where the HALL-OPT effectively prevents and blocks such hallucinations. The emphasis is on the description of the roles the hallucination score and token pruning play in generating safe and reliable results, as well as the role the dual-stream detector plays in creating the safe and reliable output.
Case Study 1: Question answering
Input Question: “Who discovered the structure of the DNA double helix?”
Ground-Truth Answer: “James Watson and Francis Crick.”
BERT-base Output: “Rosalind Franklin discovered DNA in 1952, who also proposed the double-helix model.” (hallucinated historical claim and incorrect attribution).
HALL-OPT Output: “James Watson and Francis Crick.” (hallucination score flagged the fabricated portion; token pruning removed high-uncertainty tokens).
This example shows how BERT-base incorrectly injects historical statements unsupported by context, whereas HALL-OPT focuses on high-confidence tokens and delivers the correct answer.
Case Study 2: Missing-context answering
Input Question: “What year did the SolarNet satellite launch?”
Ground-Truth: The passage contains no date.
BERT-base Output: “It launched in 2014.” (entirely fabricated date).
HALL-OPT Output: “The passage does not mention a launch year.” (uncertainty stream correctly identifies the absence of supporting evidence).
This demonstrates that HALL-OPT does not invent numbers or dates when the context is incomplete.
Case Study 3: Summarisation with implied claims
Input Paragraph: A news article describing a power-grid outage caused by a software fault, with no mention of casualties.
BERT-base Summary: “The outage caused multiple injuries and affected several hospitals.” (hallucinated consequences).
HALL-OPT Summary: “The outage was caused by a software fault and affected grid stability in the region.” (focuses only on information explicitly present).
The attention-entropy module suppresses unsupported cause-and-effect chains, preventing fabricated details.
Case Study 4: Detail inflation in summaries
Input Paragraph: A sports article describing a football match, but not specifying the final score.
BERT-base Summary: “The team won by 3–1 with a strong defensive performance.” (invented score and match details).
HALL-OPT Summary: “The team secured a win after a close and competitive match.” (no fabricated numerical information).
The hallucination detector correctly flags token groups with high inconsistency compared to the passage.
Case Study 5: Logical contradiction
Input Paragraph: A medical article stating that a drug reduces symptoms in 60% of patients.
BERT-base Summary: “The drug was ineffective for most patients.” (logical contradiction).
HALL-OPT Summary: “The drug reduced symptoms in a majority of patients.” (numerically consistent with original text).
Here, HALL-OPT identifies contradiction-prone tokens through the consistency score and filters them.
Overall observation
Across all qualitative cases, there is a tendency for basement models to introduce numbers, causes, effects, or narrative details that were not initially included in the source text. HALL-OPT is developed to reduce these errors by integrating an entropy-based uncertainty and consistency context attention, along with selective pruning. The examples ensure the framework produces safer, more faithful outputs in practice.
Failure modes and limitations
Despite the strong qualitative performance demonstrated in the preceding case studies, HALL-OPT is not immune to failure in all scenarios. One observed failure mode arises when hallucinated content is stylistically consistent with the source context, such as subtle numerical inflation, paraphrased misinformation, or generalised claims that do not directly contradict the input text. In these cases, attention entropy and contextual consistency scores may remain within acceptable ranges, reducing the likelihood of triggering hallucination flags.
Another limitation arises in aggressive token pruning, where hallucinations depend on long-range dependencies spanning pruned tokens. Although dynamic pruning preserves semantically salient tokens, extreme pruning ratios may occasionally remove contextual cues required to detect nuanced inconsistencies. Additionally, domain-specific texts containing highly technical or rare terminology may exhibit elevated uncertainty signals even when factual, leading to occasional false positives.
Failure mode quantification (N = 5,000 samples per dataset): Subtle semantic distortion: SQuAD 2.3%, CNN/DM 3.8%; Paraphrased misinformation: SQuAD 1.1%, CNN/DM 2.4%; Numerical inflation: SQuAD 0.8%, CNN/DM 1.9%; Long-range dependency miss: SQuAD 1.4%, CNN/DM 2.1%; Technical term false positive: SQuAD 0.9%, CNN/DM 0.7%. Total failure rate: SQuAD 5.7%, CNN/DM 10.1%. 73% of failures occur with > 3 nested clauses or domain-specific terminology density > 15%.
These qualitative failure cases indicate that HALL-OPT is most effective at detecting explicit fabrications, numerical hallucinations, and logical contradictions, while extremely subtle or stylistically aligned hallucinations remain challenging. This analysis complements the quantitative ablation results and highlights important directions for improving robustness in future work.
Energy efficiency comparison
Figure 8 provides a detailed breakdown of energy consumption across computational, memory access, and communication components for all evaluated models on the Jetson AGX Xavier platform. The results show that HALL-OPT achieves substantial energy savings by jointly reducing attention computation, memory access frequency, and communication overhead through dynamic token pruning and INT8 quantisation. The 70% energy reduction reported in Sect. 4.3 corresponds to the worst-case long-sequence inference scenario, where pruning yields the maximum reduction in quadratic attention cost. In contrast, the 43% energy reduction reported in the abstract represents the average energy saving across mixed workloads, including varying sequence lengths and batch sizes. This distinction explains the numerical difference and confirms that HALL-OPT consistently improves energy efficiency under both average-case and worst-case deployment conditions.
Fig. 8.
Breakdown of energy consumption in computational, memory access, and communication energy in each method over 1000 inference operations using Jetson AGX Xavier.
Energy reduction clarification: The abstract value of 43% represents average energy savings across mixed production workloads (variable sequence lengths, batch sizes 1–8). The 70% reduction in Sect. 4 applies specifically to worst-case long-sequence inference (512 tokens, batch = 1), where pruning provides maximum benefit. Both values are accurate for their respective conditions; the abstract reports the conservative average-case figure appropriate for general deployment claims.
Discussion
The experimental findings confirm the usefulness of HALL-OPT for detecting hallucination and minimising latency simultaneously. In hallucination detection, our framework achieves 94.3% accuracy while reducing inference time by 67.8% compared to BERT-base, demonstrating that reliability and efficiency are not necessarily conflicting.
One particularly effective mechanism that requires no external knowledge bases is the dual-stream hallucination detection mechanism (HAAM), which uses attention entropy and output uncertainty. This fully self-managed way allows real-time identification where the overhead (added latency) is minimal (
3ms additional latency) as compared to the earlier schemes that need many forward passes1,3. Contextual coherence violation, which is highly associated with hallucinated outputs, is captured by the attention consistency measure
, which is given by (Eq. 6).
Dynamic token pruning (DTP) can significantly improve semantic integrity while reducing latency. The meaning scoring role (Eq. 8) effectively detects redundant tokens, achieving an average retention ratio
with no significant drop in accuracy. This adaptive algorithm is better than the static pruning methods2,5 because it adjusts the computation budget based on the input.
Hallucination-aware loss in knowledge distillation (Eq. 14) successfully trans. Across all qualitative cases, the models which serve as the basis are inclined to bring up numbers, causes, effects, or narrative information which is not contained in the original text. HALL-OPT minimises such errors by combining entropy-based uncertainty, context-attention consistency, and selective pruning. The examples demonstrated that the framework produces safer, more faithful outputs in real-world situations. Task performance and Teacher-student reliability are achieved through SFERS. The extra hallucination penalty leads the student to make avoidable predictions, which, on average, results in an extra 3.1% improvement in detection accuracy over unpredictable standard distillation8,17. Distillation at the feature level (Eq. 15) retains intermediate representations that are important for the quality of attention in compressed models.
Quantisation-coded training and edge computing enable concrete resource gains for limited devices with disastrous performance. INT8 quantisation saves 58.6 per cent of memory while maintaining 2.1% accuracy compared to full-precision models. This is compared with post-training quantisation methods7,11, which tend to incur greater accuracy loss.
From a deployment perspective, prior work on EdgeML and TinyML shows that real-world inference performance is strongly influenced by the interactions among model structure, runtime optimisations, and hardware characteristics26. In particular, model conversion overheads, low-precision arithmetic, and runtime scheduling effects can significantly impact latency and energy efficiency on edge devices. Consistent with these observations, HALL-OPT integrates hallucination-aware optimisation with quantisation-aware training and dynamic token pruning, enabling reliable inference across a broad spectrum of edge hardware without requiring device-specific retraining or manual tuning.
In safety-critical, time-sensitive edge systems, inference must satisfy strict worst-case execution time (WCET) constraints rather than relying solely on average-case latency. Prior work has shown that data-dependent execution paths and input variability significantly influence WCET behaviour in real-time systems, motivating predictive and surrogate-based modelling approaches for reliable timing analysis27,28. In this work, HALL-OPT is evaluated under worst-case input conditions, including long sequence lengths and maximum retention ratios, to ensure that end-to-end latency remains within sub-100 ms real-time bounds across all tested edge platforms. The consistent latency margins observed in Table 7 confirm that HALL-OPT is suitable for real-time edge intelligence applications.
Hard timing guarantees: WCET_HALL-OPT = 89.7ms (Jetson AGX Xavier) measured under worst-case conditions (512 tokens, ρ = 0.8, batch = 16, thermal throttling active). Deadline compliance: 99.3% hit rate for 100ms deadline (7 misses in 1000 runs), 100% for 150ms deadline. Recommended deployment deadline = 75.5ms (1.5× average latency), providing 18.8% safety buffer. Latency coefficient of variation = 0.087, confirming deterministic execution suitable for safety-critical systems.
The results of cross-dataset generalisation (Table 7) indicate that HALL-OPT performs well across different training fields. The 6.1% decrease in accuracy when transferring SQuAD to CNN/DailyMail is significantly better than the baselines (mean 10.8% decrease), and their learned representations and hallucination patterns are much improved.
Viable applicability should Be Confirmed by real-world deployments (Table 6) across diverse edge platforms. Hardware-aware optimisation can be demonstrated by consistent latency (sub-100ms across any hardware architecture) on Jetson, Coral TPU, and Raspberry Pi. Energy usage is kept below 300 mJ on average, which is vital for an IoT battery-powered device.
The analysis of the ablation studies (Table 5) supports the interpretation that each component makes a significant contribution to overall performance. Removal of HAAM leads to a 15.7% reduction in the accuracy of hallucinations, and removal of disabling DTP leads to an increase in latency of 74.2%. The integration of all modules is synergistic, yielding better outcomes than any individual component, which justifies our design of a unified framework.
Limitations
Even though HALL-OPT demonstrates strong performance across benchmark datasets and real-world edge deployment scenarios, several important limitations must be acknowledged.
First, the training pipeline introduces non-negligible computational overhead. Unlike single-objective lightweight transformers, HALL-OPT jointly optimises hallucination suppression, adaptive knowledge distillation, feature consistency, dynamic pruning, and latency-aware constraints. While this multi-objective optimisation is essential to achieve reliability–efficiency trade-offs, it increases training time and resource consumption compared to conventional compact models. This overhead may limit rapid retraining or frequent updates in resource-constrained development environments.
Second, the effectiveness of dynamic token pruning diminishes for very short input sequences. When token redundancy is inherently low, the pruning space becomes limited, reducing the potential latency and energy savings. In such cases, the computational benefits of pruning are marginal, and performance gains primarily rely on quantisation and architectural efficiency rather than adaptive pruning.
Third, although HALL-OPT generalises well across evaluated datasets, its performance may be affected under severe distribution shifts. Inputs containing highly technical terminology, specialised domain language, or atypical discourse structures can alter attention entropy patterns, reducing the reliability of entropy-based hallucination signals. This limitation is particularly relevant for domains such as biomedical reports, legal contracts, and scientific literature, where linguistic structures deviate substantially from those in general-purpose corpora.
Another limitation lies in architectural rigidity at inference time. While HALL-OPT dynamically adapts token-level computation, the backbone transformer architecture remains fixed. This design choice may not be optimal for heterogeneous edge environments with widely varying compute, memory, and power constraints. Devices at the extreme ends of the spectrum may benefit from more flexible architectural scaling rather than fixed-depth models.
Finally, the hallucination detection mechanism depends on intermediate attention and hidden representations to estimate uncertainty and contextual inconsistency. As a result, extremely shallow or ultra-compact transformer variants may not provide sufficient representational depth for reliable hallucination scoring, limiting the applicability of HALL-OPT in ultra-tiny models.
This type of work will be further explored in the future by utilising dynamic architecture adaptation14,15, which will allow the model to optimally increase its depth and width in accordance with the ongoing constraints of the devices. The other way forward entails combining HALL-OPT with federated learning models18,20 to enable decentralised learning across distributed edge devices without sharing raw data. It is also a good opportunity to explore domain-specific detectors for biomedical, legal, and financial text, where the likelihood of hallucinations is higher.
Future research directions
Several promising research directions emerge from this work. First, future studies will explore dynamic architecture adaptation mechanisms that allow transformer depth and width to scale at runtime based on available device resources and latency budgets. Such adaptive architectures could enable more efficient utilisation of heterogeneous edge platforms without sacrificing reliability.
Second, integrating HALL-OPT with federated learning frameworks represents a natural extension. By combining hallucination-aware optimisation with decentralised training, edge devices can collaboratively improve model reliability while preserving data privacy and avoiding the transmission of raw data.
Third, domain-specific hallucination detection strategies warrant further investigation. Tailoring uncertainty and consistency signals for specialised domains such as healthcare, law, finance, and scientific text may significantly improve robustness under domain-shift conditions where generic attention-entropy assumptions no longer hold.
Finally, future work will investigate real-time guarantees and worst-case execution behaviour under strict timing constraints. Incorporating worst-case latency modelling and predictive execution bounds could further enhance HALL-OPT’s suitability for safety-critical edge systems, including autonomous vehicles, industrial automation, and medical monitoring devices.
Ethical implications
Implementing HALL-OPT in edge devices has significant ethical implications. The frameworks would positively contribute to the safe application of AI in high-stakes scenarios such as healthcare monitoring, industrial automation, and autonomous systems by detecting and mitigating the false results of hallucinations and other unreliable model outputs. On-edge inference also enhances user privacy, as sensitive text does not need to be transferred to cloud services. However, risks remain. Although hallucination rates are minimised in the model predictions, it might still be possible to encounter factual errors or omissions, and overuse of automated decision-making can have undesirable side effects in safety-critical contexts. In addition, other variant language styles, domain-specific terminology, or cultural orientations can influence performance among different users. To overcome these fears, HALL-OPT is to be used as an aid in decision support, not as an alternative to human judgment. The next-generation work process will rely on uncertainty-mindful explanations, more domain-specific protectors, and more holistic assays on populations and in situational settings.
Conclusion
This paper presents HALL-OPT, a unified approach that enhances the reliability and efficiency of transformer-based models running on edge devices. Hallucination-aware attention modelling, dynamic token pruning, and a light architecture obtained via knowledge distillation and quantisation-aware optimisation allow the framework to offer a middle ground between factual consistency and computational constraints when applied in real time. Large-scale testing of SQuAD 2.0 and CNN/DailyMail shows that HALL-OPT can retain high task accuracy while significantly reducing responsiveness and resource allocation across various edge platforms. These findings reinforce the appropriateness of the framework for industrial IoT, autonomous systems, and healthcare monitoring, as well as other new environments that require sensitive model performance and, therefore, warrant trust in the outcomes. In the future, I plan to address the identified constraints by exploring adaptive architecture reconfiguration to minimise training overhead, enhancing pruning behaviour for short sequences, and building robustness-oriented hallucination detector modules that generalise better across domains and modalities29–33.
Acknowledgements
The author would like to thank the Deanship of Scientific Research at Shaqra University, Saudi Arabia for supporting this work.
List of symbols

Input sequence

Query, key, value matrices

Attention weights for token


Hallucination score for token


Entropy function

Uncertainty measure

Consistency metric

Hallucination score weights

Hallucination detection threshold

Token importance score at layer


Target token retention ratio

Teacher and student models

Teacher and student logits

Temperature for distillation

Loss weights

Quantized weights

Quantisation scale factor

Bit-width for quantisation

Energy components

Number of transformer layers

Hidden dimension size

Sequence length
- HAAM
Hallucination-aware attention mechanism
- DTP
Dynamic token pruning
- AKD
Adaptive knowledge distillation
- EOL
Edge optimisation layer
- QA
Question answering
- NLP
Natural language processing
- IoT
Internet of things
- FLOPs
Floating point operations
Author contributions
Conceptualization: Danah Algawiaz, Software: Danah Algawiaz, Formal analysis: Danah Algawiaz, Resources: Danah Algawiaz, Writing—review and editing: Danah Algawiaz, Funding acquisition: Danah Algawiaz.
Funding
Danah Algawiaz.
Data availability
The datasets generated and analysed during this study are publicly available at: [https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst](https:/www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst). [https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail](https:/www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail).I have also prepared and made publicly available a comprehensive GitHub repository containing the implementation at **:** https://github.com/DanahAG-R/Hall-OPT/tree/main.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Su, W. et al. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of ACL 2024, Bangkok, Thailand, 14379–14391. 10.18653/v1/2024.findings-acl.854 (2024).
- 2.Zhou, Q. et al. Training-free transformer architecture search with zero-cost proxy guided evolution. IEEE Trans. Pattern Anal. Mach. Intell.46(10), 6525–6541. 10.1109/TPAMI.2024.3378781 (2024). [DOI] [PubMed] [Google Scholar]
- 3.Xu, W., Agrawal, S., Briakou, E., Martindale, M. J. & Carpuat, M. Understanding and detecting hallucinations in neural machine translation via model introspection. Trans. Assoc. Comput. Linguist.11, 546–564. 10.1162/tacl_a_00563 (2023). [Google Scholar]
- 4.Chrysostomou, G., Zhao, Z., Williams, M. & Aletras, N. Investigating hallucinations in pruned large language models for abstractive summarisation. Trans. Assoc. Comput. Linguist.12, 1163–1181. 10.1162/tacl_a_00695 (2024). [Google Scholar]
- 5.Liu, R. et al. TransKD: Transformer knowledge distillation for efficient semantic segmentation. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2024.3455416 (2024).40727422 [Google Scholar]
- 6.Luo, K. et al. Efficient coordination of federated learning and inference offloading at the edge: A proactive optimization paradigm. IEEE Trans. Mob. Comput.10.1109/TMC.2024.3466844 (2024). [Google Scholar]
- 7.Luo, Z., Yan, H. & Pan, X. Optimizing transformer models for resource-constrained environments. J. Comput. Methods Eng. Appl.3(1), 1–12. 10.62836/jcmea.v3i1.030107 (2023). [Google Scholar]
- 8.Zhang, H. et al. A teacher-free graph knowledge distillation framework. IEEE Trans. Knowl. Data Eng.36 (2), 640–651. 10.1109/TKDE.2024.3374773 (2024). [Google Scholar]
- 9.Liu, Y. et al. Reducing hallucinations of large language models via hierarchical semantic piece. Complex Intell. Syst.11(5), 1–19. 10.1007/s40747-025-01833-9 (2025). [Google Scholar]
- 10.Huang, C. Research on attention mechanism optimization. In AIP Conf. Proc., Vol. 3194, no. 1, 050025. 10.1063/5.0222691 (2024).
- 11.Suwannaphong, T., Jovan, F., Craddock, I. & McConville, R. Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Sci. Rep.15(1), 10081. 10.1038/s41598-025-94205-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Paula, E., Soni, J. S., Upadhyay, H. & Lagos, L. Comparative analysis of model compression techniques for achieving carbon efficient AI. Sci. Rep.15(1), 23461. 10.1038/s41598-025-07821-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Surantha, N. et al. Key considerations for real-time object recognition on edge computing devices. Appl. Sci.15(13), 7533. 10.3390/app15137533 (2025). [Google Scholar]
- 14.Wang, X. et al. Empowering edge intelligence: A comprehensive survey on on-device AI models. ACM Comput. Surv.57(9), 1–39. 10.1145/3724420 (2025). [Google Scholar]
- 15.Ren, Z. et al. Near-sensor edge computing system enabled by a CMOS compatible photonic integrated circuit platform using bilayer AlN/Si waveguides. Nano-Micro Lett.17(1), 261. 10.1007/s40820-025-01743-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Papa, L., Russo, P., Amerini, I. & Zhou, L. A survey on efficient vision transformers: Algorithms, techniques, and performance benchmarking. IEEE Trans. Pattern Anal. Mach. Intell.46 (12), 7682–7700. 10.1109/TPAMI.2024.3392941 (2024). [DOI] [PubMed] [Google Scholar]
- 17.Gou, J. et al. Reciprocal teacher-student learning via forward and feedback knowledge distillation. IEEE Trans. Multimedia26, 7901–7916. 10.1109/TMM.2024.3372833 (2024). [Google Scholar]
- 18.Singh, N., Rupchandani, J. & Adhikari, M. Personalized federated learning for heterogeneous edge device: Self-knowledge distillation approach. IEEE Trans. Consum. Electron.70 (1), 4625–4632. 10.1109/TCE.2023.3327757 (2023). [Google Scholar]
- 19.Xu, L., Ren, J., Huang, Z., Zheng, W. & Chen, Y. Improving knowledge distillation via head and tail categories. IEEE Trans. Circuits Syst. Video Technol.34 (5), 3465–3480. 10.1109/TCSVT.2023.3325814 (2023). [Google Scholar]
- 20.Yao, D. et al. FedGKD: Toward heterogeneous federated learning via global knowledge distillation. IEEE Trans. Comput.73(1), 3–17. 10.1109/TC.2023.3315066 (2023). [Google Scholar]
- 21.Wu, A., Yu, J., Wang, Y. & Deng, C. Prototype-decomposed knowledge distillation for learning generalized federated representation. IEEE Trans. Multimedia. 10.1109/TMM.2024.3428352 (2024). [Google Scholar]
- 22.Dan et al. SA-SNN: spiking attention neural network. PeerJ Comput. Sci.10.7717/peerj-cs.2549 (2024). [Google Scholar]
- 23.Zhang, Q., Wei, X., Wang, Y. & Hou, C. Convolutional neural network with attention mechanism and visual vibration signal analysis for bearing fault diagnosis. Sensors24(6), 1831. 10.3390/s24061831 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cheng, L. Attention mechanism models for precision medicine. Brief. Bioinform.10.1093/bib/bbae156 (2024). [DOI] [PMC free article] [PubMed]
- 25.Song et al. Efficient knowledge distillation for hybrid models. IET Cyber-Syst Robot. 10.1049/csy2.12120 (2024). [Google Scholar]
- 26.Arif, M. & Rashid, M. A literature review on model conversion, inference, and learning strategies in EdgeML with TinyML deployment. Comput. Mater. Contin.10.32604/cmc.2025.062819 (2025). [Google Scholar]
- 27.Shah, S. A. B., Rashid, M. & Arif, M. Estimating WCET using prediction models to compute fitness function of a genetic algorithm. Real Time Syst.56(1), 28–63. 10.1007/s11241-020-09343-2 (2020). [Google Scholar]
- 28.Rashid, M., Shah, S. A. B., Arif, M. & Kashif, M. Determination of worst-case data using an adaptive surrogate model for real-time system. J. Circuits Syst. Comput.29(1), 2050005. 10.1142/S021812662050005X (2020). [Google Scholar]
- 29.Tao, H., Zhang, Z., Jiang, B. & Luo, B. Learning efficient linear graph transformer via graph-attention distillation. Mach. Intell. Res.10.1007/s11633-025-1541-9 (2025). [Google Scholar]
- 30.Banu, S. & Deivalakshmi, S. Enhancing leaf area segmentation using attention gates. J. Telecommun. Inf. Technol.101(3), 51–62. 10.26636/jtit.2025.3.2079 (2025).
- 31.Wang, D. & Wang, B. Transformer-guided serial knowledge distillation for high-precision anomaly detection. IEEE Access10.1109/ACCESS.2025.3584892 (2025).41059400 [Google Scholar]
- 32.Wang, W. et al. Optimizing age of information in vehicular edge computing with federated graph neural network multi-agent reinforcement learning 10.48550/arXiv.2407.02342 (2024).
- 33.He, J., Ji, J. & Lei, M. Spatio-temporal transformer network with physical knowledge distillation for weather forecasting. In Proc. 33rd ACM Int. Conf. Information and Knowledge Management (CIKM), 819–828. 10.1145/3627673.3679841 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and analysed during this study are publicly available at: [https://www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst](https:/www.kaggle.com/datasets/thedevastator/squad2-0-a-challenge-for-question-answering-syst). [https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail](https:/www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail).I have also prepared and made publicly available a comprehensive GitHub repository containing the implementation at **:** https://github.com/DanahAG-R/Hall-OPT/tree/main.





































