Abstract
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
1. Introduction
Multimodal large language models (MLLMs) [1, 3, 9, 11, 44, 45, 89, 97] have become essential tools in addressing numerous vision tasks and performing complex visual question-answering due to their superior capabilities in content comprehension [32] and generation [15]. Despite their remarkable versatility, MLLMs often suffer from hallucinations. Specifically, MLLMs frequently generate convincing text responses that contradict the visual content of an image, describing elements not present in the image. Hallucinations can be categorized into two types: initial hallucinations and snowball hallucinations, as illustrated in Fig. 1. Specifically, initial hallucinations (e.g., bridge) stem from insufficient information within the model, while snowball hallucinations (e.g., handrails) occur when the model maintains consistency with previous hallucinations.
Figure 1.

Illustrates the phenomenon of snowball hallucinations as an extension of initial hallucinations. MLLMs produce hallucinations by asserting nonexistent objects (e.g., bridge) within the image, followed by further explanatory errors (e.g., handrails). This progression from initial to snowball hallucinations reveals the model’s tendency to build upon its own erroneous assumptions.
The key to mitigating hallucinations lies in extracting contextual information from the token interaction process. Recent studies focus on external knowledge retrieval [2, 62] and robust instruction fine-tuning [61, 73, 83], but these methods often incur substantial additional costs. Conversely, other approaches focus on training-free decoding strategies such as contrastive decoding [25, 31, 34, 70] and self-calibrating attention [24, 48, 53, 74]. They aimed to enhance the accuracy and consistency of generated responses by reducing excessive reliance on linguistic priors in the token interaction process. Though previous works have shown effectiveness, they lack analysis of the interaction process between multimodal tokens and the causes of hallucinations. For example, Fig. 2 illustrates a high proportion of snowball hallucinations, particularly in video captioning. Interestingly, these methods have not been effective in reducing the proportion of snowball hallucinations. In this study, we hypothesize that insufficient interaction between tokens may result in over-reliance on outlier tokens, thereby neglecting dense and informative contextual cues. In this work, we argue that intervening effectively in the token interaction process enhances in-context inference.
Figure 2.

Percentage of initial hallucination (IH) and percentage of snowball hallucination (SH) (calculated over the entire datasets) for LLaVA-1.5-7B [44], Video-LLaVA-7B [40] and EDVT [52].
To delve deeper into this phenomenon, we analyze the attention maps during decoding and identify two issues contributing to hallucinations. (i) Attention Collapse in MLLMs: As illustrated in Fig. 3 (a), we observe that the model tends to allocate disproportionate attention to tokens with limited informational content. These low-information yet high-attention outlier tokens, such as visual backgrounds and textual symbols, disrupt the effective propagation of relevant information. This issue arises because the softmax attention mechanism requires all attention scores to be non-zero and sum to one, causing even low-information or non-priority tokens to receive disproportionate attention. Attention collapse, akin to the findings in Opera [24] on the “summary token”, causing a gradual attenuation of vision and text information transmission as the generated text extends. (ii) Positional Information Decay: As illustrated in Fig. 3 (b), we observe a progressive decline in attention to dense vision information throughout the generation process. This occurs due to the rotational position encoding (RoPE) [63], whose long-term decay fails to provide adequate positional information to ensure sufficient interaction between vision and text tokens. As the relative distance increases, the flow of vision token information gradually diminishes, leading to potential hallucinations. Therefore, our findings indicate that maintaining balanced information propagation and refining positional encoding can mitigate attention collapse and positional information decay, both of which contribute to hallucinations.
Figure 3.

(a): Attention Collapse in MLLMs: Outlier tokens from different modalities are assigned disproportionately high attention scores, hindering interaction between relevant tokens. (b): Positional Information Decay: As text generation progresses, attention to visual information gradually diminishes. (c): Our FarSight, as a plug-in, mitigates these issues by effectively reducing attention interference from outlier tokens and improving response accuracy.
In this work, we propose FarSight, a versatile plug-and-play decoding strategy that reduces attention interference from outlier tokens by optimizing the causal mask. Specifically, we initialize a set of attention registers within the upper triangular matrix of the causal mask to capture attention diverted to outlier tokens. These attention registers retain the causal decoding properties, ensuring that information from future tokens is not accessed prematurely. Additionally, we design a dynamic register-attention distribution mechanism that explicitly optimizes attention allocation at each decoding step for robust in-context inference.
The core of our method is to optimize the effective propagation of tokens. We modulate attention distribution for tokens with multimodal informational content to improve token propagation. Furthermore, the relative positional limitations of RoPE encoding lead to insufficient transmission of vision-to-text token information during contextual interactions, which undermines positional awareness. Therefore, we introduce a progressively diminishing masking rate within the causal mask to encode absolute positional information, allowing the model to attend to further distant preceding tokens, especially for video sequence tasks.
With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness. Our contributions are as follows:
We analyze the self-attention token propagation patterns, revealing two main causes of hallucinations in MLLMs: attention collapse and positional information decay.
We propose FarSight, a plug-and-play decoding strategy that effectively mitigates hallucinations stemming from these issues by merely adjusting the causal mask.
Extensive evaluations on both image and video tasks demonstrate the superior performance of FarSight, offering an effective solution for mitigating hallucinations.
2. Related Work
Hallucinations in MLLMs.
Leveraging open-source large language models like LLaMA [66, 67] and Vicuna [6], MLLMs [4, 12, 23, 35, 38, 57, 65, 77, 81, 87, 90] can understand and generate a wide range of content more effectively by combining information from multiple modalities, such as text, images, and audio. Hallucination in MLLMs [16, 17, 21, 27, 42, 46, 47, 60, 64, 91] refers to the generation of text that is misaligned with the content of the provided images. Hallucination may originate from reliance on model priors [7, 19, 26, 33, 39, 43, 86, 93], limited knowledge comprehension [21, 42, 56, 80, 96], or an inability to effectively contextualize the given input [8, 22, 36, 48, 49, 69, 76]. According to the causes of hallucination, hallucinations can be classified into two types: initial hallucinations [58, 71] occur due to the model lacking necessary information; snowball hallucinations [88, 94] arise when the model generates a series of hallucinations to maintain consistency with previous ones, even when the required knowledge is available. In this paper, we primarily conducted experiments and analyses on image and video benchmarks.
Hallucination Mitigation for MLLMs.
Researchers have proposed various strategies, from data optimization to model adjustments, to improve the accuracy and consistency of generated content. To mitigate hallucination, solutions include robust instruction tuning [28, 82, 83, 85], post-hoc processing using auxiliary analysis networks [14, 72, 79, 92], and various decoding strategies [13, 24, 31, 34, 70, 98]. Recent studies have focused on outlier tokens, causing generated text to emphasize summarizing information from these tokens rather than utilizing dense and rich contextual cues. Additionally, some studies [53, 74, 78] have found that RoPE positional encoding is insufficient to support information propagation between multimodal tokens in contextual reasoning. This paper proposes an optimized causal masking approach to extract sufficient contextual information during token interactions, effectively mitigating hallucination without additional training, data, or inference time.
3. Preliminary and Motivation
3.1. Paradigm of MLLMs Generation
Vision and Language Inputs.
The inputs of MLLMs consist of both image and text. Generally, the raw images are commonly fed to the visual encoder. Then the cross-model projection module maps vision information into LLMs’ input space, which is denoted as vision tokens where is the length of vision tokens. Similarly, text is processed by tokenizer and embedding modules, which is denoted as text tokens where is the length of text tokens. Then, the image and text tokens are concatenated as the final input and denoted as where .
MLLMs Forward.
The backbone networks of MLLMs are pre-trained LLMs (e.g., Vicuna [6] and LLaMA 2 [6]), parameterized by that auto-regressively generates responses. Given a multimodal input sequence , the model maps the logit distribution to the next token prediction output at time step in the vocabulary set :
| (1) |
where denotes all previously generated tokens .
Next Token Decoding.
After obtaining the next token probability , different decoding strategies [8, 18, 24] are proposed to predict the next token. The decoded token is concatenated to the last of the original input text for the next-round generation, until the generation is ended.
3.2. What Causes Hallucinations
Attention Collapse in MLLMs.
We investigate the self-attention in the transformer block [68] of the auto-regressive decoder and leverage a column-wise product to calculate metric values. Denote the current generated sequence as and their causal self-attention weights as applied to the next token prediction. The weights can be obtained from the softmax function as follows:
| (2) |
where are the Query, Key, and Value matrices. and are the sequence length and the hidden dimensions, is the causal mask, and is the output. The causal mask ensures that the model does not attend to future tokens, preserving causality in the sequence. The attention weights are structured as follows:
| (3) |
Proposition 3.1 (Attention Collapse in MLLMs). Let inputs be sampled from a data distribution and processed by a contextual, layer-wise decoder with attention layers. Define the disproportionality in an attention layer as measured by the total probability of prefixes , where the attention collapse after applying softmax for in the l-th layer satisfies:
| (4) |
Here, denotes the mutual information between two variables and , indicating the amount of shared information between them. represents that the token is informationally dependent on the preceding sequence , quantifying how much information about is contained within .
Remark:
Proposition 3.1 indicates that Attention Collapse refers to the phenomenon where the attention weights for certain tokens far exceed the informational contribution of those tokens. This often occurs with semantically irrelevant tokens, such as non-functional words (i,e., punctuation marks) and background vision tokens. As a result, the focus of the model diffuses across these irrelevant tokens, increasing perplexity during length extrapolation and hindering interaction among semantic tokens, as illustrated in Fig. 3 (a).
Positional Information Decay.
The vanilla attention model lacks positional awareness, as it does not encode relative distance between tokens. In contrast, RoPE [63] addresses this by encoding the positional data of tokens using a rotation matrix, which inherently includes an explicit relative position dependency. Within each attention , RoPE is applied across all projected query and key inputs to compute the attention weights by leveraging relative distance between tokens. Consequently, the attention with relative position embedding is expressed as:
| (5) |
where denotes the rotary position embedding matrices applied to the query and key. stands for relative position between and . The long-term decay refers to the decrease of as the relative distance increases.
Remark:
RoPE integrates relative position data by multiplying rotation matrices rather than appending positional embeddings to the input. The relative proximity between two tokens effectively determines their influence, as closer tokens should impact each other more than distant ones. However, using the same attention mechanism for both vision and text tokens results in unintentional text generation in MLLMs, as illustrated in Fig. 3 (b). Consequently, we argue that RoPE long-term decay limits multimodal tokens’ information propagation, which contributes to hallucination. In contrast, maintaining absolute positional focus in generated text could allow the model to achieve precise positional awareness and improve response accuracy.
4. Methodology
Fig. 4 provides an overview of the proposed strategy, built upon an LLM decoding paradigm in Section 3.1. Attention registers, detailed in Section 4.1, are introduced to absorb outlier tokens’ attention scores, dynamically guiding the model toward contextually rich semantic information. Meanwhile, a progressively diminishing masking rate is introduced to capture absolute positional focus with rigorous theoretical justification, which is described in Section 4.2. For ease of comprehension of how FarSight works, Algorithm 1 exhibits the pseudo-code in the decoder layer.
Figure 4.

The scheme of the proposed FarSight strategy, which integrates with the softmax operation, replacing the traditional causal mask. Specifically, the attention score matrix is cleared of attention values in the upper triangular part, then register-attention scores are added using the matrix followed by the softmax computation. has linear decay in the upper triangular part and zeros in the lower triangular part. After the softmax operation, the remaining attention probabilities in the upper triangular part are cleared to ensure the causal decoding property is preserved.
4.1. Upper Triangular Metric as Attention Registers
To alleviate attention collapse issues, we propose the dedicated attention register to allocate excess attention scores. For each , we construct an upper triangular score matrix as attention register, defined as follows:
| (6) |
where allocates register-attention scores in each row to handle excess attention values while maintaining zero values for positions up to . To integrate with , we adjust by adding the register-attention scores from as follows:
| (7) |
where denotes a lower-triangular matrix filled with ones to ensure causal masking by allowing attention only to preceding or current tokens, as illustrated in Fig. 4.
Since the model is training-free, the attention-registers should not interfere with the original attention score distribution during inference and align with the relative positional encoding in Eq. 5 to maintain coherence in generated text. For , the values are defined as:
| (8) |
where is a decay rate hyperparameter. This setup ensures that conforms to the gradual attenuation pattern in attention. Thus, the final attention score matrix with FarSight is defined as:
where denotes the original attention score at , with capturing the self-attention along the diagonal. The decay factor applied for future tokens , which enforces causal masking. The standard causal mask operation in Eq. 2 is then modified as:
| (9) |
Remark:
The within the SoftMax function incorporates register-attention scores by masking the attention matrix, while the outside SoftMax ensures that any masked scores are reset to zero. This design enables FarSight to retain causal decoding properties, preventing information from future tokens is not accessed prematurely. The register-attention matrix effectively captures and buffers excess attention by providing dedicated slots for surplus values, ensuring that the main attention mechanism remains focused on relevant tokens without being distracted by irrelevant or future positions.
4.2. Positional Awareness Encoding
The core idea of absolute position encoding is to modify the attention matrix so that the sum of actual attention scores (located in the lower triangular part of the attention matrix ) is not constrained to equal 1, as illustrated in Fig. 4. Specifically, we introduce a progressively diminishing masking rate in the causal mask, allowing attention distributions to vary across positions, thereby effectively incorporating absolute positional information. Specifically, let denote the actual attention scores of the row and the register-attention scores in the row. The overall attention distribution for the row is given by:
where denotes the normalized attention score on position in the row, while denotes the normalized register-attention score for in the same row.
The model encodes positional information for a sequence of identical input tokens, , by leveraging both attention score accumulation and decay. Specifically, the actual attention scores are uniform across each row. Consequently, the cumulative sum of their exponentiated values progressively increases with the row index . This cumulative increase emphasizes information before the current position, contributing to the encoding of absolute positional information. Simultaneously, the cumulative sum of the exponentiated register-attention scores decreases as increases due to the applied decay in . This decay constrains attention on content after the current position, ensuring that attention primarily emphasizes preceding information. Consequently, we obtain:
indicating that, after applying Eq. 9, the accumulated attention over valid tokens exhibits a monotonically increasing trend with respect to the row index , i.e. , satisfying , progressively encoding the absolute positional context. This progressive allocation allows the model to maintain an ordered information flow across positions, where tokens at later positions aggregate increasingly more historical context from preceding tokens. As grows, the model sharpens its focus on earlier tokens, reinforcing long-range dependencies and enhancing positional awareness in the generated sequence.
5. Experiments
5.1. Experimental Setup
Baseline.
We select six representative MLLMs to evaluate performance across image and video tasks, including InstructBLIP [10], LLaVA-1.5 [44], VILA [41], Video-LLaMA2 [5], Chat-UniVi [29], and Video-LLaVA [40]. InstructBLIP and LLaVA-1.5 primarily focus on image tasks, while VILA and Video-LLaMA2 specialize in video tasks. Chat-UniVi and Video-LLaVA are capable of processing both image and video data, allowing for a comprehensive evaluation across both modalities. More detailed descriptions are provided in Appendix A.
Evaluation Benchmarks.
We conduct evaluations on both image and video benchmarks. For image benchmarks, we assess three categories: (1) Comprehensive benchmarks (MMBench [50], LLaVAW [45], MM-Vet [84]); (2) General VQA benchmarks (VizWiz [20], SQA [51]); (3) Hallucination benchmarks (POPE [37], CHAIR [59]). For video, we evaluate three zero-shot video understanding datasets: MSRVTT-QA [30], MSVD-QA [75], and ActivityNet-QA [95], along with the Video-Based Text Generation Benchmark for quantitative analysis [54].
Implementation Details.
FarSight supports Greedy, Sampling, and Beam Search decoding strategies, with Greedy decoding used for illustration. Details of the other methods are in Appendix E. For the Decay Factor, we set the sequence length to 256 and define the decay rate in Eq. 8 as , with is 1024, the typical maximum token limit. Extensive experiments confirm ensure stable and consistent generation.
5.2. Abalation Study
Effect of Attention Registers.
We experiment with various register-attention values to assess their impact on attention register performance. As shown in Fig. 5, our FarSight method improves performance by +6.4% and +5.4% on for LLaVA-1.5 and Video-LLaVA, respectively, significantly outperforming other attention values. In contrast, the causal masking with restricts attention allocation in the upper triangular matrix, leading to instability in long-distance dependencies and reduced accuracy. Zero-padding fails to absorb excess attention effectively, increasing the risk of hallucinations during text generation. Although a fixed value of 10−3 introduces moderate attention absorption, which prevents excessive focus on irrelevant tokens, it still underperforms compared to our method.
Figure 5.

Comparison of different Upper Triangular Attention Values in Attention Registers. (a) and (b) show model performance with varying upper triangular attention values on the CHAIR and POPE-P datasets.
Effect of Positional Awareness Encoding.
We adopt various positional embedding strategies in the attention layer to assess their impact on the hallucination performance. As shown in Table 1, the baseline RoPE [63], FixVPE [52] and EDVT [52] strategies in LLaVA-1.5 and Video-LLaVA result in high hallucination rates. Specifically, RoPE introduces relative positional encoding between visual and text tokens, reducing attention to visual tokens during text generation. Although FixVPE’s fixed positional embeddings enhance the consistency of visual information, they are less effective than EDVT’s equidistant attention strategy. In contrast, our FarSight significantly improves CHAIR performance by using a progressively diminishing causal mask, retaining attention on earlier tokens (e.g., visual tokens).
Table 1.
Comparison of our Positional Awareness Encoding with other methods on the CHAIR [59] and POPE [37] datasets. RoPE: rotary positional embedding for both visual and text tokens, as used in the original MLLMs. FixVPE: fixed rotary embedding for visual tokens only. EDVT: rotary embedding for text tokens only.
| Method | POPE-R ↑ | POPE-P ↑ | ||
|---|---|---|---|---|
| LLaVA-1.5 (RoPE) | 48.0 | 13.9 | 87.0 | 82.8 |
| + FixVPE | 47.3 | 13.4 | 87.5 | 84.7 |
| + EDVT | 46.8 | 14.5 | 87.8 | 85.4 |
| + FarSight (Ours) | 41.6 (+6.4) | 13.2 (+0.7) | 90.5 (+3.5) | 86.1 (+3.3) |
| Video-LLaVA (RoPE) | 50.2 | 15.6 | 81.6 | 85.3 |
| + FixVPE | 48.5 | 14.9 | 81.9 | 85.2 |
| + EDVT | 46.8 | 13.7 | 82.5 | 84.7 |
| + FarSight (Ours) | 44.8 (+5.4) | 12.9 (+2.7) | 83.2 (+1.6) | 85.8 (+0.5) |
Effect of Decay Factor in Attention Registers.
We investigate the effect of query sequence lengths on attention decay, as shown in Fig. 6. In both the POPE-R and MSRVTT-QA datasets, MLLMs achieve peak accuracy at a sequence length of 256, with performance starting to decline as the sequence length continues to increase. This can be attributed to the decay factor, which is closely linked to the sequence length. Specifically, as defined in Section 5.1 (Implementation Details), the decay factor is influenced by the sequence length and directly affects the rate of attention decay. For shorter sequences, the decay factor rises rapidly, limiting the model’s ability to capture distant context. Conversely, for longer sequences, the decay factor may initially have a less pronounced effect, but as sequence length increases, attention distribution becomes diluted, increasing decay and information redundancy. A moderate sequence length (e.g., 256) effectively balances the decay factor, maintaining optimal focus on key information and preventing dispersion.
Figure 6.

The impact of sequence length on attention decay and the performance of MLLMs on the POPE-R and MSRVTT-QA datasets integrated with our FarSight.
Quantitative Analysis.
We visualize the responses and performance of LLaVA-1.5 across different methods and scenarios. Fig. 7 (a) shows that FarSight achieves higher accuracy in identifying query-relevant key regions than Baseline and EDVT. This improvement results from its dynamic attention register, which reallocates attention to task-related visual information and reduces attention to irrelevant tokens. The long-term decay curves in Fig. 7 (b) show that FarSight maintains strong attention on image tokens in later generation stages, enabled by progressive positional encoding that balances attention between visual and textual tokens throughout the sequence. Fig. 7 (c) shows attention distribution under varying decay rates. As the decay rate increases, the model’s attention becomes progressively more concentrated, reaching optimal focus at a decay rate of 0.8. However, when the decay rate further increases to 0.9, attention starts to disperse. This indicates the importance of a moderate decay rate for balanced attention.
Figure 7.

Qualitative Visualization of FarSight in Image Understanding Task on LLaVA-1.5. (a) Comparison of the average attention allocation to images during text generation among Base (Vanilla MLLMs), EDVT and our FarSight; (b) Visual attention decay across different methods within the generation of 60 text tokens; (c) FarSight’s attention distribution on images under varying decay rat . More detailed visualizations of images and videos are provided in Appendix F.
5.3. Comparison to State-of-the-Arts
GPT-4o Assisted Evaluation.
To comprehensively evaluate the overall quality of generated text, we employ the PPL (Perplexity) metric and utilize GPT-4o to assess the grammar, fluency, and naturalness of the text. We randomly select 600 images from the MSCOCO dataset and perform validation using the LLaVA-1.5 and Video-LLaVA. As demonstrated in Fig. 8, FarSight consistently preserves the quality of the generated text across multiple dimensions.
Figure 8.

The average performance is evaluated on a randomly selected set of 600 images from the MSCOCO dataset. PPL1 and PPL2 are calculated using GPT-3.5 Turbo, while the ratings for Grammar, Fluency and Naturalness are provided by GPT-4o.
Image Benchmarks Evaluation.
To evaluate the image understanding, we compare models with the FarSight extension against several decoding methods, including ICD [70], VCD [34] and OPERA [24], as shown in Table 2. Integrating FarSight as a plugin into LLaVA-1.5 results in an average improvement of +2% in the Comprehensive and General VQA tasks. It also achieves significant gains in hallucination metrics, with and POPE-P scores increasing by +6.4% and +3.3%, respectively. These results indicate that FarSight is effective at reducing hallucinations in both structured and unstructured environments. Furthermore, the benefits of FarSight extend beyond the LLaVA-1.5 model, as other models also experience considerable enhancements, especially in hallucination evaluation tasks, with the metric increasing.
Table 2.
Comparison of different MLLMs and FarSight across all image benchmarks. Notably, in the Hallucination Benchmark, lower scores on and indicate better performance, while higher scores are preferable for other metrics.
| Method | Comprehensive Benchmark | General VQA | Hallucination Benchmark | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MMBench ↑ | LLaVAW | MM-Vet↑ | VizWiz↑ | SQA↑ | POPE-R↑ | POPE-P↑ | POPE-A↑ | |||
| LLaVA-1.5 | 64.3 | 72.5 | 30.5 | 48.5 | 64.5 | 48.0 | 13.9 | 87.0 | 82.8 | 76.6 |
| +ICD | 63.1 | 69.7 | 30.4 | 46.9 | 62.8 | 47.7 | 13.6 | 87.9 | 84.0 | 80.2 |
| +VCD | 63.9 | 70.9 | 29.5 | 43.4 | 63.3 | 46.8 | 13.2 | 87.0 | 83.5 | 78.1 |
| +OPERA | 64.4 | 72.0 | 31.4 | 50.0 | 64.9 | 45.2 | 12.7 | 88.8 | 82.8 | 79.2 |
| + FarSight (Ours) | 66.0 (+1.7) | 74.7 (+2.2) | 32.5 (+2.0) | 50.8 (+2.3) | 67.4 (+2.9) | 41.6 (+6.4) | 13.2 (+0.7) | 90.5 (+3.5) | 86.1 (+3.3) | 80.4 (+3.8) |
| InstructBLIP | 43.4 | 58.2 | 25.6 | 33.4 | 62.1 | 55.6 | 24.2 | 88.7 | 81.3 | 74.4 |
| + FarSight (Ours) | 46.5 (+3.1) | 61.0 (+2.8) | 27.8 (+2.2) | 36.0 (+2.6) | 63.4 (+1.3) | 51.8 (+3.8) | 23.0 (+1.2) | 89.5 (+0.8) | 85.8 (+4.5) | 76.7 (+2.3) |
| Video-LLaVA | 60.9 | 73.1 | 32.0 | 48.1 | 64.6 | 50.2 | 15.6 | 81.6 | 85.3 | 86.2 |
| + FarSight (Ours) | 62.8 (+1.9) | 74.5 (+1.4) | 32.8 (+0.8) | 50.3 (+2.2) | 66.2 (+1.6) | 44.8 (+5.4) | 12.9 (+2.7) | 83.2 (+1.6) | 85.8 (+0.5) | 87.1 (+0.9) |
| Chat-UniVi | 56.3 | 70.4 | 28.3 | 46.9 | 59.9 | 52.3 | 16.7 | 85.1 | 69.5 | 64.4 |
| + FarSight (Ours) | 59.8 (+3.5) | 72.6 (+2.2) | 30.7 (+2.4) | 48.2 (+1.3) | 62.4 (+2.5) | 48.9 (+3.4) | 15.2 (+1.5) | 87.4 (+2.3) | 69.7 (+0.2) | 65.3 (+0.9) |
Video Benchmarks Evaluation.
In Zero-Shot Video Question Answering Tasks, FarSight achieves significant improvements over video MLLMs across three key benchmark datasets. As shown in Table 3, on the MSRVTT-QA dataset, our method delivers an average accuracy gain of +3% across multiple models, reaching a peak accuracy of 68.9%. On MSVD-QA and ActivityNet-QA datasets, FarSight improves accuracy by +2% and +0.7%, respectively, demonstrating consistent enhancements across different video contexts and question types. Moreover, in Video-Based Text Generation, the integrated model outperforms the baseline MLLMs across five critical dimensions.
Table 3.
Comparison of different Video MLLMs and FarSight across all video benchmarks. In the Video-Based Text Generation Benchmark, five scores are assessed: Cr. (Correctness of Information), Cs. (Consistency), De. (Detail Orientation), Ct. (Contextual Understanding) and Te. (Temporal Understanding). Following Maaz et al. [55], we use the GPT-3.5 Turbo model to assign a relative score to the model outputs, with scores ranging from 0 to 5. See Appendix E for further details.
| Method | MSVD-QA | ActivityNet-QA | Video-Based Text Generation | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy↑ | Score↑ | Accuracy↑ | Score↑ | Cr.↑ | Cs.↑ | De.↑ | Ct.↑ | Te.↑ | |
| Chat-UniVi | 64.6 | 3.6 | 43.1 | 3.2 | 2.84 | 2.93 | 2.55 | 3.16 | 2.43 |
| + FarSight (Ours) | 66.4 (+1.8) | 3.5 | 43.7 (+0.6) | 3.2 | 2.86 | 2.94 | 2.56 | 3.19 | 2.48 |
| Video-LLaVA | 64.8 | 3.7 | 41.5 | 3.3 | 2.32 | 2.34 | 2.65 | 2.75 | 2.09 |
| + FarSight (Ours) | 66.2 (+1.4) | 3.6 | 42.0 (+0.5) | 3.5 | 2.43 | 2.38 | 2.93 | 2.84 | 2.14 |
| VILA | 72.6 | 4.0 | 50.2 | 3.3 | 3.14 | 3.40 | 2.71 | 3.43 | 2.58 |
| + FarSight (Ours) | 74.5 (+1.9) | 4.2 | 51.4 (+1.2) | 3.6 | 3.18 | 3.52 | 2.73 | 3.45 | 2.60 |
| Video-LLaMA2 | 70.9 | 3.8 | 49.9 | 3.3 | 3.13 | 3.23 | 2.70 | 3.42 | 2.45 |
| + FarSight (Ours) | 73.8 (+2.9) | 3.9 | 50.4 (+0.5) | 3.6 | 3.26 | 3.32 | 3.21 | 3.50 | 2.47 |
6. Conclusion
In this work, we analyze the self-attention token propagation patterns, revealing two main causes of hallucinations in MLLMs: attention collapse and positional information decay. To mitigate them, we present FarSight, a plug-and-play decoding strategy that reduces interference from outlier tokens and enhances in-context inference. The core of our method is effective token propagation, which is achieved by optimizing the causal mask with attention registers and a diminishing masking rate. Extensive experiments on both image and video tasks have shown that the proposed method outperforms existing state-of-the-art methods, and the ablation study has revealed the effectiveness of our FarSight.
References
- 1.Bai Jinze, Bai Shuai, Yang Shusheng, Wang Shijie, Tan Sinan, Wang Peng, Lin Junyang, Zhou Chang, and Zhou Jingren. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. [Google Scholar]
- 2.Caffagni Davide, Cocchi Federico, Moratelli Nicholas, Sarto Sara, Cornia Marcella, Baraldi Lorenzo, and Cucchiara Rita. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1818–1826, 2024. [Google Scholar]
- 3.Chen Keqin, Zhang Zhao, Zeng Weili, Zhang Richong, Zhu Feng, and Zhao Rui. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. [Google Scholar]
- 4.Chen Lin, Li Jisong, Dong Xiaoyi, Zhang Pan, He Conghui, Wang Jiaqi, Zhao Feng, and Lin Dahua. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. [Google Scholar]
- 5.Cheng Zesen, Leng Sicong, Zhang Hang, Xin Yifei, Li Xin, Chen Guanzheng, Zhu Yongxin, Zhang Wenqi, Luo Ziyang, Zhao Deli, and Bing Lidong. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. [Google Scholar]
- 6.Chiang Wei-Lin, Li Zhuohan, Lin Zi, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (accessed 14 April 2023), 2023. [Google Scholar]
- 7.Cho Jaemin, Yoon Seunghyun, Kale Ajinkya, Dernoncourt Franck, Bui Trung, and Bansal Mohit. Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115, 2022. [Google Scholar]
- 8.Chuang Yung-Sung, Xie Yujia, Luo Hongyin, Kim Yoon, Glass James, and He Pengcheng. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. [Google Scholar]
- 9.Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony Meng Huat, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, and Hoi Steven. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. [Google Scholar]
- 10.Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony Meng Huat, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, and Hoi Steven. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. [Google Scholar]
- 11.Dong Xiaoyi, Zhang Pan, Zang Yuhang, Cao Yuhang, Wang Bin, Ouyang Linke, Wei Xilin, Zhang Songyang, Duan Haodong, Cao Maosong, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. [Google Scholar]
- 12.Elhoushi Mostafa, Shrivastava Akshat, Liskovich Diana, Hosmer Basil, Wasti Bram, Lai Liangzhen, Mahmoud Anas, Acun Bilge, Agarwal Saurabh, Roman Ahmed, et al. Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024. [Google Scholar]
- 13.Favero Alessandro, Zancato Luca, Trager Matthew, Choudhary Siddharth, Perera Pramuditha, Achille Alessandro, Swaminathan Ashwin, and Soatto Stefano. Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303–14312, 2024. [Google Scholar]
- 14.Feng Shangbin, Shi Weijia, and et al. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. arXiv preprint arXiv:2402.00367, 2024. [Google Scholar]
- 15.Geng Zigang, Yang Binxin, Hang Tiankai, Li Chen, Gu Shuyang, Zhang Ting, Bao Jianmin, Zhang Zheng, Li Houqiang, Hu Han, et al. Instructdiffusion: A generalist modeling interface for vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12709–12720, 2024. [Google Scholar]
- 16.Ghosh Akash, Acharya Arkadeep, Jha Prince, Saha Sriparna, Gaudgaul Aniket, Majumdar Rajdeep, Chadha Aman, Jain Raghav, Sinha Setu, and Agarwal Shivani. Medsumm: A multimodal approach to summarizing code-mixed hindi-english clinical queries. In European Conference on Information Retrieval, pages 106–120. Springer, 2024. [Google Scholar]
- 17.Ghosh Akash, Acharya Arkadeep, Saha Sriparna, Jain Vinija, and Chadha Aman. Exploring the frontier of vision-language models: A survey of current methodologies and future directions. arXiv preprint arXiv:2404.07214, 2024. [Google Scholar]
- 18.Graves Alex. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. [Google Scholar]
- 19.Gunjal Anisha, Yin Jihan, and Bas Erhan. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 18135–18143, 2024. [Google Scholar]
- 20.Gurari Danna, Li Qing, Stangl Abigale J, Guo Anhong, Lin Chi, Grauman Kristen, Luo Jiebo, and Bigham Jeffrey P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018. [Google Scholar]
- 21.Hu Hongyu, Zhang Jiyuan, Zhao Minyi, and Sun Zhenbang. Ciem: Contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301, 2023. [Google Scholar]
- 22.Huang Lei, Yu Weijiang, Ma Weitao, Zhong Weihong, Feng Zhangyin, Wang Haotian, Chen Qianglong, Peng Weihua, Feng Xiaocheng, Qin Bing, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023. [Google Scholar]
- 23.Huang Qidong, Dong Xiaoyi, Chen Dongdong, Zhang Weiming, Wang Feifei, Hua Gang, and Yu Nenghai. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10878–10887, 2023. [Google Scholar]
- 24.Huang Qidong, Dong Xiaoyi, Zhang Pan, Wang Bin, He Conghui, Wang Jiaqi, Lin Dahua, Zhang Weiming, and Yu Nenghai. Opera: Alleviating hallucination in multimodal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 2 [Google Scholar]
- 25.Huo Fushuo, Xu Wenchao, Zhang Zhong, Wang Haozhao, Chen Zhicheng, and Zhao Peilin. Self-introspective decoding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032, 2024. [Google Scholar]
- 26.Jain Jitesh, Yang Jianwei, and Shi Humphrey. Vcoder: Versatile vision encoders for multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27992–28002, 2024. [Google Scholar]
- 27.Ji Ziwei, Lee Nayeon, Frieske Rita, Yu Tiezheng, Su Dan, Xu Yan, Ishii Etsuko, Bang Ye Jin, Madotto Andrea, and Fung Pascale. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. [Google Scholar]
- 28.Jiang Chaoya, Xu Haiyang, Dong Mengfan, Chen Jiaxing, Ye Wei, Yan Ming, Ye Qinghao, Zhang Ji, Huang Fei, and Zhang Shikun. Hallucination augmented contrastive learning for multimodal large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27036–27046, 2024. [Google Scholar]
- 29.Jin Peng, Takanobu Ryuichi, Zhang Wancai, Cao Xiaochun, and Yuan Li. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024. [Google Scholar]
- 30.Kim Junghwan, Kim Michelle, and Mozafari Barzan. Provable memorization capacity of msrvtt-qa. In International Conference on Learning Representations, 2023. [Google Scholar]
- 31.Kim Taehyeon, Kim Joonkee, Lee Gihun, and Yun Se-Young. Instructive decoding: Instruction-tuned large language models are self-refiner from noisy instructions. In The Twelfth International Conference on Learning Representations, 2024. [Google Scholar]
- 32.Lai Xin, Tian Zhuotao, Chen Yukang, Li Yanwei, Yuan Yuhui, Liu Shu, and Jia Jiaya. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. [Google Scholar]
- 33.Lee Nayeon, Ping Wei, and et al. Factuality enhanced language models for open-ended text generation. NeurIPS, 35: 34586–34599, 2022. [Google Scholar]
- 34.Leng Sicong, Zhang Hang, Chen Guanzheng, Li Xin, Lu Shijian, Miao Chunyan, and Bing Lidong. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 2 [Google Scholar]
- 35.Li Bo, Zhang Yuanhan, Guo Dong, Zhang Renrui, Li Feng, Zhang Hao, Zhang Kaichen, Li Yanwei, Liu Ziwei, and Li Chunyuan. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. [Google Scholar]
- 36.Li Xiang Lisa, Holtzman Ari, Fried Daniel, Liang Percy, Eisner Jason, Hashimoto Tatsunori, Zettlemoyer Luke, and Lewis Mike. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022. [Google Scholar]
- 37.Li Yifan, Du Yifan, Zhou Kun, Wang Jinpeng, Zhao Xin, and Wen Ji-Rong. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, 2023. Association for Computational Linguistics. [Google Scholar]
- 38.Li Yanwei, Zhang Yuechen, Wang Chengyao, Zhong Zhisheng, Chen Yixin, Chu Ruihang, Liu Shaoteng, and Jia Jiaya. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024. [Google Scholar]
- 39.Li Zhang, Yang Biao, Liu Qiang, Ma Zhiyin, Zhang Shuo, Yang Jingxu, Sun Yabo, Liu Yuliang, and Bai Xiang. Monkey: Image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26763–26773, 2024. [Google Scholar]
- 40.Lin Bin, Ye Yang, Zhu Bin, Cui Jiaxi, Ning Munan, Jin Peng, and Yuan Li. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. [Google Scholar]
- 41.Lin Ji, Yin Hongxu, Ping Wei, Lu Yao, Molchanov Pavlo, Tao Andrew, Mao Huizi, Kautz Jan, Shoeybi Mohammad, and Han Song. Vila: On pre-training for visual language models, 2023. [Google Scholar]
- 42.Liu Fuxiao, Lin Kevin, Li Linjie, Wang Jianfeng, Yacoob Yaser, and Wang Lijuan. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. [Google Scholar]
- 43.Liu Fuxiao, Lin Kevin, Li Linjie, Wang Jianfeng, Yacoob Yaser, and Wang Lijuan. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023. [Google Scholar]
- 44.Liu Haotian, Li Chunyuan, Li Yuheng, and Lee Yong Jae. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. [Google Scholar]
- 45.Liu Haotian, Li Chunyuan, Wu Qingyang, and Lee Yong Jae. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. [Google Scholar]
- 46.Liu Hanchao, Xue Wenyuan, Chen Yifei, Chen Dapeng, Zhao Xiutian, Wang Ke, Hou Liping, Li Rongjun, and Peng Wei. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. [Google Scholar]
- 47.Liu Jiazhen, Fu Yuhan, Xie Ruobing, Xie Runquan, Sun Xingwu, Lian Fengzong, Kang Zhanhui, and Li Xirong. Phd: A prompted visual hallucination evaluation dataset. arXiv preprint arXiv:2403.11116, 2024. [Google Scholar]
- 48.Liu Shi, Zheng Kecheng, and Chen Wei. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. arXiv preprint arXiv:2407.21771, 2024. [Google Scholar]
- 49.Liu Yexin, Liang Zhengyang, Wang Yueze, He Muyang, Li Jian, and Zhao Bo. Seeing clearly, answering incorrectly: A multimodal robustness benchmark for evaluating mllms on leading questions. arXiv preprint arXiv:2406.10638, 2024. [Google Scholar]
- 50.Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. [Google Scholar]
- 51.Lu Pan, Mishra Swaroop, Xia Tanglin, Qiu Liang, Chang Kai-Wei, Zhu Song-Chun, Tafjord Oyvind, Clark Peter, and Kalyan Ashwin. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. [Google Scholar]
- 52.Ma Fan, Jin Xiaojie, Wang Heng, Xian Yuchen, Feng Jiashi, and Yang Yi. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13151–13160, 2024. [Google Scholar]
- 53.Ma Fan, Jin Xiaojie, Wang Heng, Xian Yuchen, Feng Jiashi, and Yang Yi. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13151–13160, 2024. [Google Scholar]
- 54.Maaz Muhammad, Rasheed Hanoona, Khan Salman, and Shahbaz Khan Fahad. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. [Google Scholar]
- 55.Maaz Muhammad, Rasheed Hanoona, Khan Salman, and Shahbaz Khan Fahad. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. [Google Scholar]
- 56.Manakul Potsawee, Liusie Adian, and Gales Mark JF. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023. [Google Scholar]
- 57.McKinzie Brandon, Gan Zhe, Fauconnier Jean-Philippe, Dodge Sam, Zhang Bowen, Dufter Philipp, Shah Dhruti, Du Xianzhi, Peng Futang, Weers Floris, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024. [Google Scholar]
- 58.Ouyang Long, Wu Jeffrey, Jiang Xu, Almeida Diogo, Wainwright Carroll, Mishkin Pamela, Zhang Chong, Agarwal Sandhini, Slama Katarina, Ray Alex, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. [Google Scholar]
- 59.Rohrbach Anna, Hendricks Lisa Anne, Burns Kaylee, Darrell Trevor, and Saenko Kate. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018. [Google Scholar]
- 60.Sahoo Pranab, Singh Ayush Kumar, Saha Sriparna, Chadha Aman, and Mondal Samrat. Enhancing adverse drug event detection with multimodal dataset: Corpus creation and model development. arXiv preprint arXiv:2405.15766, 2024. [Google Scholar]
- 61.Sarkar Pritam, Ebrahimi Sayna, Etemad Ali, Beirami Ahmad, Arık Sercan Ö, and Pfister Tomas. Mitigating object hallucination via data augmented contrastive tuning. arXiv preprint arXiv:2405.18654, 2024. [Google Scholar]
- 62.Shuster Kurt, Poff Spencer, Chen Moya, Kiela Douwe, and Weston Jason. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021. [Google Scholar]
- 63.Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021. [Google Scholar]
- 64.Tang Feilong, Huang Zile, Liu Chengzhi, Sun Qiang, Yang Harry, and Lim Ser-Nam. Intervening anchor token: Decoding strategy in alleviating hallucinations for MLLMs. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
- 65.Tong Shengbang, Brown Ellis, Wu Penghao, Woo Sanghyun, Middepogu Manoj, Akula Sai Charitha, Yang Jihan, Yang Shusheng, Iyer Adithya, Pan Xichen, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024. [Google Scholar]
- 66.Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [Google Scholar]
- 67.Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [Google Scholar]
- 68.Vaswani Ashish, Shazeer Noam, and et al. Attention is all you need. NeurIPS, 30, 2017. [Google Scholar]
- 69.Wang Junyang, Zhou Yiyang, Xu Guohai, Shi Pengcheng, Zhao Chenlin, Xu Haiyang, Ye Qinghao, Yan Ming, Zhang Ji, Zhu Jihua, et al. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. [Google Scholar]
- 70.Wang Xintong, Pan Jingheng, Ding Liang, and Biemann Chris. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024. [Google Scholar]
- 71.Wang Yizhong, Kordi Yeganeh, Mishra Swaroop, Liu Alisa, Smith Noah A, Khashabi Daniel, and Hajishirzi Hannaneh. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. [Google Scholar]
- 72.Wu Junfei, Liu Qiang, and et al. Logical closed loop: Uncovering object hallucinations in large vision-language models. arXiv preprint arXiv:2402.11622, 2024. [Google Scholar]
- 73.Xiao Wenyi, Huang Ziwei, Gan Leilei, He Wanggui, Li Haoyuan, Yu Zhelun, Jiang Hao, Wu Fei, and Zhu Linchao. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233, 2024. [Google Scholar]
- 74.Xing Yun, Li Yiheng, Laptev Ivan, and Lu Shijian. Mitigating object hallucination via concentric causal attention. arXiv preprint arXiv:2410.15926, 2024. [Google Scholar]
- 75.Xu D, Zhao Zhou, Xiao Jun, Wu Fei, Zhang Hanwang, He Xiangnan, and Zhuang Yueting. Video question answering via gradually refined attention over appearance and motion. Proceedings of the 25th ACM international conference on Multimedia, 2017. [Google Scholar]
- 76.Xue Haochen, Tang Feilong, Hu Ming, Liu Yexin, Huang Qidong, Li Yulong, Liu Chengzhi, Xu Zhongxing, Zhang Chong, Feng Chun-Mei, et al. Mmrc: A large-scale benchmark for understanding multimodal large language model in real-world conversation. arXiv preprint arXiv:2502.11903, 2025. [Google Scholar]
- 77.Ye Qinghao, Xu Haiyang, Ye Jiabo, Yan Ming, Hu Anwen, Liu Haowei, Qian Qi, Zhang Ji, and Huang Fei. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13040–13051, 2024. [Google Scholar]
- 78.Yin Qingyu, He Xuzheng, Zhuang Xiang, Zhao Yu, Yao Jianhua, Shen Xiaoyu, and Zhang Qiang. Stablemask: Refining causal masking in decoder-only transformer. arXiv preprint arXiv:2402.04779, 2024. [Google Scholar]
- 79.Yin Shukang, Fu Chaoyou, Zhao Sirui, Xu Tong, Wang Hao, Sui Dianbo, Shen Yunhang, Li Ke, Sun Xing, and Chen Enhong. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023. [Google Scholar]
- 80.You Haoxuan, Zhang Haotian, Gan Zhe, Du Xianzhi, Zhang Bowen, Wang Zirui, Cao Liangliang, Chang Shih-Fu, and Yang Yinfei. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023. [Google Scholar]
- 81.Young Alex, Chen Bei, Li Chao, Huang Chengen, Zhang Ge, Zhang Guanwei, Li Heng, Zhu Jiangcheng, Chen Jianqun, Chang Jing, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. [Google Scholar]
- 82.Yu Qifan, Li Juncheng, Wei Longhui, Pang Liang, Ye Wentao, Qin Bosheng, Tang Siliang, Tian Qi, and Zhuang Yueting. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12944–12953, 2024. [Google Scholar]
- 83.Yu Tianyu, Yao Yuan, Zhang Haoye, He Taiwen, Han Yifeng, Cui Ganqu, Hu Jinyi, Liu Zhiyuan, Zheng Hai-Tao, Sun Maosong, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. [Google Scholar]
- 84.Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. [Google Scholar]
- 85.Yue Zihao, Zhang Liang, and Jin Qin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. arXiv preprint arXiv:2402.14545, 2024. [Google Scholar]
- 86.Zhai Bohan, Yang Shijia, Zhao Xiangchen, Xu Chenfeng, Shen Sheng, Zhao Dongdi, Keutzer Kurt, Li Manling, Yan Tan, and Fan Xiangjun. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779, 2023. [Google Scholar]
- 87.Zhang Haotian, Gao Mingfei, Gan Zhe, Dufter Philipp, Wenzel Nina, Huang Forrest, Shah Dhruti, Du Xianzhi, Zhang Bowen, Li Yanghao, et al. Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566, 2024. [Google Scholar]
- 88.Zhang Muru, Press Ofir, Merrill William, Liu Alisa, and Smith Noah A. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023. [Google Scholar]
- 89.Zhang Pan, Wang Xiaoyi Dong Bin,Cao Yuhang, Xu Chao, Ouyang Linke, Zhao Zhiyuan, Ding Shuangrui, Zhang Songyang, Duan Haodong, Yan Hang, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023. [Google Scholar]
- 90.Zhang Pan, Dong Xiaoyi, Zang Yuhang, Cao Yuhang, Qian Rui, Chen Lin, Guo Qipeng, Duan Haodong, Wang Bin, Ouyang Linke, Zhang Songyang, Zhang Wenwei, Li Yining, Gao Yang, Sun Peng, Zhang Xinyue, Li Wei, Li Jingwen, Wang Wenhai, Yan Hang, He Conghui, Zhang Xingcheng, Chen Kai, Dai Jifeng, Qiao Yu, Lin Dahua, and Wang Jiaqi. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024. [Google Scholar]
- 91.Zhang Yue, Li Yafu, Cui Leyang, Cai Deng, Liu Lemao, Fu Tingchen, Huang Xinting, Zhao Enbo, Zhang Yu, Chen Yulong, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023. [Google Scholar]
- 92.Zhang Youcai, Huang Xinyu, Ma Jinyu, Li Zhaoyang, Luo Zhaochuan, Xie Yanchun, Qin Yuzhuo, Luo Tong, Li Yaqian, Liu Shilong, et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1724–1732, 2024. [Google Scholar]
- 93.Zhao Zhiyuan, Wang Bin, Ouyang Linke, Dong Xiaoyi, Wang Jiaqi, and He Conghui. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023. [Google Scholar]
- 94.Zhong Weihong, Feng Xiaocheng, Zhao Liang, Li Qiming, Huang Lei, Gu Yuxuan, Ma Weitao, Xu Yuan, and Qin Bing. Investigating and mitigating the multimodal hallucination snowballing in large vision-language models. arXiv preprint arXiv:2407.00569, 2024. [Google Scholar]
- 95.Zhou Luowei, Xu Chenliang, and Corso Jason J.. Towards automatic learning of procedures from web instructional videos. In AAAI, 2017. [Google Scholar]
- 96.Zhou Yiyang, Cui Chenhang, Yoon Jaehong, Zhang Linjun, Deng Zhun, Finn Chelsea, Bansal Mohit, and Yao Huaxiu. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023. [Google Scholar]
- 97.Zhu Deyao, Chen Jun, Shen Xiaoqian, Li Xiang, and Elhoseiny Mohamed. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. [Google Scholar]
- 98.Zhu Lanyun, Ji Deyi, Chen Tianrun, Xu Peng, Ye Jieping, and Liu Jun. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. arXiv preprint arXiv:2402.18476, 2024. [Google Scholar]
