Abstract
Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VIDHALLUC, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VIDHALLUC assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a trainingfree method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VIDHALLUC, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VIDHALLUC benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.
1. Introduction
The rapid advancement of multimodal large language models (MLLMs) [30, 33, 50] has led to significant improvements in video understanding, particularly in semantic reasoning and instruction-following [48]. Despite these improvements, MLLMs tend to generate hallucinations, producing plausible but factually incorrect information [24], raising concerns about their reliability in practical applications. To better understand and quantify these limitations, several benchmarks [17, 28, 54] and mitigation strategies [61, 67] have been developed to systematically evaluate hallucinations in MLLMs. These benchmarks curate fine-grained question-answer (QA) pairs designed to trigger hallucinations, often created using either GPT-4V [6, 59] or human annotators [38]. However, these benchmarks come with several limitations. First, the datasets are typically small in scale, containing fewer than 1,000 videos and 2,000 QA pairs. This limited scale is due to the time and cost involved in generating high-quality QA pairs, making the process inefficient. Second, the primary focus of the existing benchmarks is on hallucinations in static elements of videos [8, 35], such as objects [28, 45], their attributes [47, 53], and spatial relationships [18]. This limits the benchmark’s efficacy in evaluating dynamic and temporal content in videos. Third, the benchmarks are constructed with limited question types (e.g., binary QA only), constraining the assessment of the model’s understanding of dynamic aspects, such as temporal scene transitions, inherent to video content.
To address these gaps, we introduce VIDHALLUC, the largest benchmark for evaluating hallucinations in video understanding, comprising 5,002 videos and 9,295 QA pairs. To curate the benchmark, we build on prior work [51] to develop an automated data collection pipeline that generates potential hallucination video pairs based on semantic similarity and visual differences. Moreover, while previous benchmarks often focus on static tasks, VIDHALLUC specifically targets dynamic elements of videos that cause hallucinations, namely, action, temporal sequence, and scene transition hallucinations.
On the other hand, efforts have been made to mitigate MLLM hallucinations in video understanding. These methods include constructing high-quality instruction-tuning datasets [32], enhancing input resolution [34], and leveraging reinforcement learning from human feedback (RLHF) [47, 62, 67, 69]. However, these data-centric optimization approaches still require fine-tuning, which incurs significant computational costs and necessitates creating additional datasets. As shown in Figure 1, we observe that current MLLMs are vulnerable to hallucinations when they encounter semantically similar videos. We attribute these hallucinations to the inductive bias inherent in the visual encoder (e.g., the CLIP series), which emphasizes the discrimination of contextual scenes due to its image-text contrastive learning. This creates a discrepancy between the encoder’s provision of semantic information about the video and the language model’s need for both semantic and vision-only unimodal representations [51].
Figure 1.
An example of a video pair showing action hallucination in VIDHALLUC. Adversarial questions, which refer to actions in the other video of the pair, show that MLLMs are prone to hallucinations when input videos have high semantic but low visual similarity. None of the models accurately identify the prominent actions in this pair.
To address this issue, we propose DINO-HEAL, a novel, training-free algorithm designed to mitigate MLLM hallucinations in video understanding by enhancing the visual encoder’s focus on salient spatial regions during inference. DINO-HEAL applies to any CLIP-series vision encoder. To reduce reliance on the CLIP series and better encode spatially important features, DINO-HEAL uses saliency maps from DINOv2 [41] to reweight frame features, highlighting key regions. This approach enables models to achieve substantial accuracy gains, such as +5.01% for Video-LLaVA [30] and +4.77% for Video-ChatGPT [37] in action hallucination, as well as notable improvements in temporal sequence hallucination, with Video-ChatGPT and VideoLLaMA2 [9] achieving gains of +11.67% and +18.83%, respectively.
In summary, our main contributions are as follows:
We introduce VIDHALLUC, the largest benchmark for assessing hallucinations in MLLMs for video understanding, designed to evaluate action, temporal sequence, and scene transition hallucinations.
We develop a training-free algorithm, DINO-HEAL, to enhance the visual encoder’s focus on critical regions and improve the model’s robustness against hallucinations.
We conduct extensive experiments on VIDHALLUC with ten state-of-the-art models and provide a comprehensive analysis of the benchmark results. Additionally, we demonstrate the effectiveness of DINO-HEAL, achieving an average gain of 3.02% across all hallucination categories and five models.
2. Related Works
2.1. Multimodal Large Language Models
Building on the success of large language models (LLMs) [3, 11, 13, 14, 42, 44, 52, 55], researchers have enhanced these models with vision capabilities, leading to the development of MLLMs [12, 25, 71]. This is achieved by processing images as tokens concatenated with text tokens through cross-attention [2] or by directly aligning visual features with nonlinear projections [22, 33]. Recent advancements have led to numerous MLLMs designed specifically for video understanding [21, 31], which build upon image approaches to address the sequential and dynamic nature of video content. For example, Video-ChatGPT [37] employs temporal aggregation techniques to integrate information across frames, Video-LLaVA [30] uses time-sensitive attention mechanisms to manage frame dependencies, and LLaVA-NeXT-Video [66] introduces a linear scaling approach to process longer video sequences, effectively capturing extended temporal dependencies.
Most MLLMs use CLIP-based visual encoders [10, 34, 39, 43, 46, 63] for robust visual-text alignment. However, these encoders, shaped by image-text contrastive learning, struggle with dynamic scenes and temporal relationships in videos [57]. To assess and highlight these biases, VIDHALLUC leverages CLIP/SigLIP and DINOv2 [41], trained solely on image data to provide visual features, to curate video pairs that may cause confusion and hallucinations, offering a rigorous evaluation framework for MLLMs.
2.2. Hallucination in MLLMs
Hallucination in MLLMs occurs when the model generates inaccurate or entirely fabricated responses, failing to accurately reflect the input data. To mitigate this issue in image-based tasks, Woodpecker [60] uses post-hoc corrections through additional visual cues, while LRV-Instruction [32] and HalluciDoctor [61] offer balanced instruction-tuning datasets to reduce hallucinations. As MLLMs extend into video understanding tasks, new challenges arise in handling temporal inconsistencies. To tackle this, Volcano[23] uses a critique-revise-decide framework for self-correction, and Vript [59] enhances alignment by incorporating transcribed voice-overs, enriching multimodal representations. These methods showcase tailored strategies for minimizing hallucinations across different modalities in MLLMs.
While most methods require retraining or fine-tuning the MLLM to reduce hallucinations, our DINO-HEAL approach can be applied directly during inference without additional training, offering an efficient solution for mitigating hallucinations in resource-constrained scenarios.
2.3. MLLM Benchmarks for Video Understanding
As MLLMs continue to advance, conducting quantitative evaluations remains essential. Numerous efforts [5, 27, 29, 36, 40, 59] have been undertaken to assess various aspects of MLLMs in video understanding, encompassing tasks such as video reasoning [29], text-to-video retrieval [27], video captioning [5], and hallucination [59]. These benchmarks assess models’ capabilities in understanding and generating language from video content. For example, MVBench [26] provides an extensive evaluation of temporal understanding in video-language models, encompassing 20 challenging video tasks that require dynamic analysis beyond individual frames. Video-MME [15] offers a comprehensive assessment of MLLMs’ video analysis performance, covering diverse types of videos, varying durations, and multiple data modalities. In long-form video understanding, benchmarks such as HourVideo [4] and LongVideoBench [56] focus on event ordering. In the context of hallucination, HallusionBench [17] reverses the frames in four-frame video clips and prompts the model to determine whether the events in the original captions align with those in the reversed video. Similarly, VideoHallucer [54] uses extended videos with multiple events to query the model on the correct sequence of these events. However, these hallucination benchmarks have limitations in scale, providing no more than 1,000 videos, which restricts their ability to comprehensively evaluate hallucinations.
3. VIDHALLUC Benchmark
3.1. Data Collection
Unlike other benchmarks for evaluating hallucinations in video understanding that rely on manually collected videos, our data collection pipeline is semi-automated (Figure 2). To create the VIDHALLUC benchmark, we apply this pipeline to existing video description datasets, including ActivityNet [19], YouCook2 [68], and VALOR32K [7]. In contrast to existing approaches, which typically use individual videos, our method treats video pairs as the primary data units. Below, we outline the key stages of our data collection process, including filtering for semantic and visual similarity, performing quality checks, conducting human validation, and generating hallucination-specific questions. For implementation details, please refer to the Supplementary Material.
Figure 2.
Overview of the VIDHALLUC benchmark construction process. Candidate video pairs are selected based on high semantic similarity and low visual similarity. GPT-4 is then used to generate action and scene annotations from the captions of each video clip. Human reviewers manually filter out pairs where GPT-4 annotations were incorrect or where actions/scenes did not align between clips. Finally, video pairs that pass this filtering process are used to automatically generate three types of hallucination questions: action hallucination, time-sequence hallucination, and scene-transition hallucination.
Semantic and Visual Similarity Filtering.
To identify potential hallucination pairs, we measure semantic similarity using CLIP/SigLIP and visual similarity with DINOv2. For semantic similarity, we apply average pooling to the features of CLIP/SigLIP, while DINOv2 uses a similar pooling method to calculate visual similarity. Video pairs scoring above the semantic threshold with CLIP/SigLIP but below the visual threshold with DINOv2 are flagged as hallucination candidates. These pairs highlight cases where models may over-rely on semantic cues, leading to potential misinterpretations of visually unique content. We use CLIP/SigLIP and DINOv2 for their complementary strengths: CLIP/SigLIP, trained on the cosine similarity between visual and textual projections, captures deep contextual relationships and thematic elements, while DINOv2, trained with a self-supervised objective, focuses on visual resemblance. This method also tests the model’s sensitivity to subtle visual differences that might be overlooked in single-video evaluations [51]. and are set to 0.9 and 0.6, respectively.
Quality Filtering.
After collecting the video pairs, we perform additional filtering to ensure dataset quality. First, we exclude pairs where either video is shorter than one second. Next, we leverage GPT-4 [1] to align each video with the actions and scenes described in the original dataset’s annotated captions. Finally, we filter out video pairs that contain identical extracted actions.
Human Validation of Video Pairs.
We recruit four participants to conduct a manual review to eliminate pairs with the following issues: (1) the lack of a clear action in either video, (2) the presence of multiple actions in either video, or (3) identical actions in both videos. These rigorous steps result in a high-quality dataset, which serves as a reliable benchmark.
Automatic Question Generation.
We categorize the benchmark into three distinct hallucination types: (1) action hallucination, (2) temporal sequence hallucination, and (3) scene transition hallucination. For each type, we design specific question formats to evaluate model performance: binary QA and multiple-choice questions (MCQs) for action hallucination, sorting questions for temporal sequence hallucination, and open-ended questions for scene transition hallucination. After the filtering steps, the high-quality video pairs, along with their corresponding annotations, are further processed to automatically generate questions tailored to each hallucination type and question format. An example question for each hallucination type is shown in Figure 3.
Figure 3.
Examples of three hallucination types in the VIDHALLUC benchmark: (1) action hallucination, where the model detects actions in a video that significantly differ from the actual actions; (2) temporal sequence hallucination, where the model fails to represent the correct temporal order of events in a video; and (3) scene transition hallucination, where the model inaccurately describes transitions between distinct scenes within a video.
Human Verification of QAs.
We recruit nine participants to verify and refine the entire dataset. For ACH, they check whether the action described in the question matches the video and correct any inaccuracies. For TSH, they ensure that the action order in the answer aligns with the video, making revisions where necessary. For STH, they verify that the scene described in the answer corresponds to the video and adjust any mismatches. Following verification, 939 ACH, 68 TSH, and 38 STH questions are modified, resulting in an accuracy of 88.76% for the automatically generated QAs.
3.2. Action Hallucination
Action hallucination (ACH) occurs when a model identifies actions in a video that are either non-existent or significantly different from the actual actions. This typically arises when the model misinterprets visual cues or overgeneralizes, leading to inaccurate conclusions about the video’s content. In VIDHALLUC, we evaluate the performance of MLLMs in detecting action hallucinations using two methods: binary QA and MCQs. The MCQs test the model’s basic understanding of actions, assessing its ability to identify them without extra context. In contrast, binary QA evaluates the model’s susceptibility to hallucinations, especially when confronted with leading or suggestive questions.
For the binary QA, we use a question format such as “Is the prominent action in the video {action}?”. Each video pair involves two distinct actions. For instance, we first ask the question with Video 1, then with Video 2. The model should answer “Yes” for the video containing the action and “No” for the video without it. A correct response requires the model to answer both questions correctly. We evaluate the model’s performance using accuracy, as defined in formula 1, where is the number of correctly answered questions and is the total number of questions.
| (1) |
For MCQs, we use a format such as “What is the prominent action in the video?”. Each question has four answer choices, with only one correct answer representing the action occurring in the video. One incorrect option corresponds to the action in the other video in the pair, while the remaining two options are generated using GPT-4o [20]. By providing a frame from the video, we ask GPT-4o to generate plausible actions that could occur in the same scene, but distinct from the correct action. The model must select the correct action from the four options. Accuracy remains the evaluation metric, calculated in the same manner as in Equation 1.
3.3. Temporal Sequence Hallucination
Temporal sequence hallucination (TSH) occurs when a model misinterprets the order or timing of events in a video, confusing the sequence in which actions or scenes unfold. In VIDHALLUC, model performance is evaluated on TSH by testing its ability to identify the correct sequence of events. To do this, we first select two video clips from a pair that appear consecutively in the original video. These clips are then concatenated in their original chronological order to form a longer video segment. The model’s task is to determine the correct sequence of actions within this concatenated segment. We use accuracy, calculated as in Equation 1, as the evaluation metric to assess how well the model interprets the temporal relationships between the consecutive events.
3.4. Scene Transition Hallucination
Scene transition hallucination (STH) occurs when a model fails to detect or describe transitions between different scenes in a video. This can result in blending elements from different scenes or failing to recognize the start of a new scene, leading to inaccurate descriptions. We design evaluation tasks to assess MLLMs’ ability to detect scene transitions accurately. To create these tasks, we first filter video pairs to identify those with distinct scenes, using captions from the original datasets to detect scene differences. After filtering, each video pair is concatenated in both possible orders to produce two unique long video clips. Additionally, an equal number of video pairs with no scene changes are collected to ensure dataset balance. For video clips containing a scene transition, the ground truth labels indicate “Yes” for scene change and describe the transition as “from [Scene1] to [Scene2]”. For clips without a scene change, the ground truth labels indicate “No”.
The model is tasked with determining if a scene change occurs in each concatenated video segment and, if so, identifying the specific transition. We evaluate the model’s performance based on two criteria: (1) scene change classification, which assesses whether the model correctly identifies a scene change, and (2) scene transition description, which measures the model’s ability to describe the transition from one scene to another accurately. For the scene change classification task, we calculate the Matthews correlation coefficient (MCC) between the model’s predictions and the ground truth labels, as defined in Equation 2:
| (2) |
where , , and denote actual condition, predicted condition, and non-negative counts, respectively. To adjust MCC to range between 0 and 1 and to further penalize models that consistently answer only “Yes” or only “No”, we apply the transformation in Equation 3 to obtain the classification score :
| (3) |
The description task evaluates the model’s ability to accurately recognize and describe scene information. To assess this, we extract scene descriptions from both the model’s output and the ground truth, formatting them for direct comparison. We then calculate the cosine similarity between the SimCSE [16] embeddings of the corresponding scenes. Based on this similarity measure, each scene description is assigned a score using Equation 4:
| (4) |
where is the cosine similarity between the SimCSE [16] embeddings of the corresponding scenes, is the lowest threshold we set to start giving a score, represents the Sigmoid function. The total description score is the average of the scores for the “from” and “to” scenes.
Finally, we calculate the overall evaluation score as a weighted sum of the classification score and the normalized description score as below:
| (5) |
3.5. Dataset Statistics
Table 2 summarizes the statistics for VIDHALLUC. The dataset consists of 5,002 unique videos in total, with 3,957 videos for ACH, 600 for TSH, and 445 for STH. VIDHALLUC also includes 9,295 QA pairs, with 8,250 questions for ACH, 600 for TSH, and 445 for STH. The average video length varies by hallucination type: ACH videos have an average length of approximately 21.79 seconds, TSH videos average around 41.19 seconds, and STH videos average 28.72 seconds. The overall average length across the benchmark is 24.70 seconds. Figure 4 illustrates the distribution of video durations, showing a peak in the 10–20 second range. These statistics highlight the diversity and scale of VIDHALLUC, establishing it as a valuable resource for evaluating video-based hallucinations.
Table 2.
VIDHALLUC statistics.
| Statistics | ACH | TSH | STH | Total |
|---|---|---|---|---|
|
| ||||
| # of Videos | 3,957 | 600 | 445 | 5,002 |
| # of Questions | 8,250 | 600 | 445 | 9,295 |
| Average Duration (s) | 21.79 | 41.19 | 28.72 | 24.70 |
Figure 4.
The distribution of video durations in the VIDHALLUC benchmark.
4. DINO-HEAL
We introduce DINO-HEAL, a training-free method that mitigates hallucinations by using saliency maps from DINOv2 to reweight features from the frozen visual encoder, emphasizing key spatial regions. DINO-HEAL requires no architectural modifications or additional training.
4.1. DINOv2 Saliency Map Extraction
For each input frame, the method processes it through the DINOv2 model to extract attention weights from the last layer and the -th head, denoted as , where is the sequence length (number of tokens), including the [CLS] token. We then compute the average attention across all heads:
| (6) |
where is the total number of attention heads. Then, we extract the attention values of the [CLS] token over all spatial tokens, excluding itself, as:
| (7) |
where represents the saliency scores derived from DINOv2 for spatial tokens. We compute the saliency score for each spatial token by summing the attention weights across the query dimension:
| (8) |
This results in a saliency map that represents the importance of each spatial token in the frame.
4.2. Alignment and Upsampling of Saliency Maps
To align the DINOv2 saliency map with the original visual features, we reshape into a 2D grid corresponding to the spatial dimensions of the DINOv2 tokens:
| (9) |
where and are the height and width of the DINOv2 token grid. We then upsample the saliency map to match the spatial dimensions of the visual features from the original visual encoder:
| (10) |
where and denote the height and width of the visual feature grid, corresponding to the spatial dimensions of the features extracted by the original visual encoder. We then flatten the upsampled saliency map back into a vector and normalize it with the sigmoid function:
| (11) |
4.3. Reweighting Visual Features
Finally, we enhance the visual features by reweighting them with the saliency map:
| (12) |
where denotes element-wise multiplication. This adaptive reweighting strategy enables DINOv2 to enhance key visual features by directly focusing on areas highlighted by the saliency map, thereby mitigating hallucinations while preserving the original feature representation.
5. Experiments
We evaluate ten state-of-the-art MLLMs on VIDHALLUC, including eight open-source and two proprietary models (Section 5.1). Our results reveal that most MLLMs exhibit notable vulnerabilities on VIDHALLUC. To further evaluate the quality of VIDHALLUC, we also conduct a human evaluation of our benchmark. We recruit four participants, each answering a randomly assigned half of the ACH, TSH, and STH queries, ensuring that each query is covered by two individuals. The questions, answer formats, and evaluation metrics are identical to those used for the models to ensure fairness. We then assess the performance of our DINO-HEAL on VIDHALLUC with five different MLLMs, demonstrating its effectiveness in enhancing the robustness of these models against various types of hallucinations (Section 5.2). During inference, we preserve each model’s original configuration, including conversation mode, hyperparameters, and frame count. Following standard practices [9, 64], we set the temperature to 0, top_k to 1, and disable sampling for all models to avoid randomness in response generation. For proprietary models, we sample 16 frames. Implementation details and full results are in the Supplementary Material.
5.1. Evaluation on VIDHALLUC
Table 3 presents the performance of all tested MLLMs on VIDHALLUC, showing both accuracy and score metrics as percentages, reflecting the occurrence of hallucinations across various tasks. For the ACH task, we observe that most models score at least 20% higher on MCQs compared to binary QA, even though both question types are based on the same video and the same ground truth action. This significant difference is due to the MCQ design, which includes the ground truth action, an adversarial (semantically similar) action, and two distractors (irrelevant actions). In contrast, binary QA presents either the ground truth or an adversarial action. Due to high semantic similarity, the model struggles to answer “No” to adversarial actions, as it lacks the comparative context of MCQs and must rely solely on action semantics. Higher MCQ accuracy suggests that when multiple options are available, the model can leverage additional contextual cues to differentiate semantically similar actions, reducing hallucination risks. Without this context in binary QA, the model is more prone to confusion and errors.
Table 3.
Performance comparison of existing MLLMs on VIDHALLUC, covering action hallucination (ACH), temporal sequence hallucination (TSH), and scene transition hallucination (STH) tasks. For the STH task, we assign a weight of 0.6 to the classification task and 0.4 to the description task. The numbers in the table represent accuracy percentages (%). Bold numbers denote the best performance, and underlined numbers indicate the second-best performance.
| Models | LLM Params | Encoder | Frame | Accuracy on ACH |
Accuracy on TSH↑ | Score on STH↑ | |
|---|---|---|---|---|---|---|---|
| Binary QA↑ | MCQ↑ | ||||||
|
| |||||||
| Video-ChatGPT [37] | 7B | CLIP ViT-L/14–224 | 100 | 9.50 | 24.58 | 30.17 | 7.70 |
| Video-LLaVA [30] | 7B | LanguageBind-Video | 8 | 26.84 | 64.45 | 27.17 | 29.60 |
| ShareGPT4Video [6] | 8B | CLIP ViT-L/14–336 | 16 | 29.96 | 44.78 | 49.50 | 17.83 |
| Chat-UniVi [21] | 13B | CLIP ViT-L/14–224 | fps=1 | 23.77 | 54.79 | 35.50 | 29.87 |
| LLaVA-NeXT-Video [66] | 34B | CLIP ViT-L/14–336 | 32 | 26.60 | 77.57 | 21.33 | 44.40 |
| PLLaVA [58] | 13B | CLIP ViT-L/14–336 | 16 | 35.30 | 76.96 | 16.50 | 32.44 |
| VideoLLaMA2 [9] | 7B | CLIP ViT-L/14–336 | 16 | 50.04 | 83.84 | 37.17 | 65.12 |
| VILA1.5 [31] | 13B | SigLIPViT-SO-14–384 | 8 | 58.46 | 81.88 | 63.33 | 35.03 |
|
| |||||||
| Gemini-1.5-Pro [49] | - | - | 16 | 75.27 | 79.25 | 83.83 | 63.96 |
| GPT-4o [20] | - | - | 16 | 81.15 | 90.95 | 82.00 | 71.58 |
|
| |||||||
| Human | - | - | - | 95.14 | 93.29 | 90.17 | 87.43 |
For TSH, most models score below 50%, revealing significant challenges in distinguishing between semantically similar but distinct actions over time, particularly when these actions appear similar to the visual encoder. For example, models often struggle to differentiate between successive actions, such as “lifting a cup” followed by “placing a cup down,” when they occur in rapid sequence within the same video. Notably, in nearly 50% incorrect cases, models perceive only a single action throughout the entire video, failing to detect multiple actions or transitions. Similarly, for STH, scores are generally low, with most models failing to surpass 40%, highlighting substantial difficulties in accurately detecting and segmenting scene changes.
We also observe that model size and the number of input frames do not correlate directly with performance. For instance, LLaVA-NeXT-Video-34B, which processes 32 frames, performs significantly worse on each task compared to VILA1.5–13B, which operates with only 8 frames. Additionally, models with higher-resolution visual encoders generally outperform those with lower resolutions. Specifically, models using the CLIP ViT-L/14–336 encoder achieve higher scores across nearly all metrics compared to those using the lower-resolution CLIP ViT-L/14–224 encoder.
Proprietary models generally outperform open-source models across most tasks, with GPT-4o standing out in particular. However, GPT-4o still falls short of human performance, showing a 13.99% gap in ACH binary QA and a 15.85% gap in STH. In contrast, Gemini-1.5-Pro does not consistently surpass the best open-source models. For instance, its STH score (63.96%) is slightly lower than that of VideoLLaMA2 (65.12%). These results highlight the need for improvement even in top-performing proprietary models.
Because GPT-4o plays a key role in the MCQ component by generating plausible distractors, we investigate whether it gains an unfair advantage by recognizing its own generated options. To assess this, we test GPT-4o on 300 videos with corresponding MCQs. Among these, GPT-4o correctly identifies its own generated distractors only twice, misidentifies them 33 times, and responds with “I don’t know” in 265 instances. These results suggest that GPT-4o is unlikely to gain an unfair advantage during evaluation.
5.2. Evaluation of DINO-HEAL on VIDHALLUC
To evaluate the effectiveness of DINO-HEAL in mitigating hallucinations, we use Video-LLaVA [30], Video-ChatGPT [37], VILA [31], and VideoLLaMA2 [9] as the foundation backbones. Table 4 shows that DINO-HEAL enables substantial improvements across various hallucination types, achieving an average gain of 3.02%.
Table 4.
Performance comparison of models on action hallucination (ACH), temporal sequence hallucination (TSH), and scene transition hallucination (STH) tasks, with and without DINO-HEAL. Improvements from DINO-HEAL are shown as subscripts. Bold numbers denote the best performance after applying DINO-HEAL.
| Models | Acc. on ACH | TSH | STH | |
|---|---|---|---|---|
| Binary QA | MCQ | |||
|
| ||||
| Video-ChatGPT | 9.50 | 24.58 | 30.17 | 7.70 |
| +DINO-HEAL | 13.96+4.46 | 28.81+4.23 | 41.83+11.66 | 8.20+0.5 |
| Video-LLaVA | 26.84 | 64.45 | 27.17 | 29.60 |
| +DINO-HEAL | 33.80+6.96 | 66.25+1.8 | 28.50+1.33 | 31.42+1.82 |
| ShareGPT4Video | 29.96 | 44.78 | 49.50 | 17.83 |
| +DINO-HEAL | 30.41+0.45 | 44.43−0.35 | 55.33+5.83 | 18.39+0.56 |
| VILA1.5 | 58.46 | 81.88 | 63.33 | 35.03 |
| +DINO-HEAL | 60.63 +2.17 | 81.85−0.03 | 64.17 +0.84 | 36.15+1.12 |
| VideoLLaMA2 | 50.04 | 83.84 | 37.17 | 65.12 |
| +DINO-HEAL | 50.01−0.03 | 83.84 +0.0 | 39.50+2.33 | 66.17 +1.05 |
In the ACH task, DINO-HEAL leads to notable accuracy gains, with Video-LLaVA and Video-ChatGPT showing increases of +6.96% and +4.46% in Binary QA, respectively.
For the TSH task, DINO-HEAL significantly enhances temporal coherence. In contrast, improvements in the STH task are more limited, with an average 1.01% increase in performance. We attribute the phenomenon to the inherent focus of DINOv2, the visual encoder, on foreground objects rather than background elements or scene transitions, as it is primarily trained to extract visual features through discriminative self-supervised learning. This foreground bias enhanced DINO-HEAL’s impact on action recognition, driving significant gains in ACH and TSH. However, it limits its effectiveness for background scene understanding, as models may overlook background changes crucial for accurate scene boundary detection.
DINO-HEAL is also designed to be compatible with a variety of visual encoders beyond CLIP, such as LanguageBind-Video [70] in Video-LLaVA and SigLIP [63] in VILA1.5. After applying DINO-HEAL, both Video-LLaVA and VILA1.5 demonstrate consistent improvements across nearly all tasks, highlighting DINO-HEAL’s versatility and effectiveness across diverse model architectures.
6. Conclusion and Future Work
We introduce VIDHALLUC, the largest benchmark for evaluating action, temporal sequence, and scene transition hallucinations in MLLMs for video understanding. We also present DINO-HEAL, a novel training-free method to mitigate MLLM hallucinations by enhancing the visual encoder’s focus on salient spatial regions during inference, improving model robustness against hallucinations without additional training. Extensive experiments show the effectiveness of DINO-HEAL in mitigating hallucinations across models. Future work includes expanding hallucination categories to assess models in diverse settings and enhancing DINO-HEAL with a dual-stream design integrating both spatial and temporal saliency for improved video understanding.
Supplementary Material
Figure 5.
DINO-HEAL pipeline. Since DINOv2 effectively captures salient regions in the input video, we leverage it to guide the reweighting of the attention given to different spatial regions within the feature from the visual encoder.
Figure 6.
Comparative results on VIDHALLUC across various hallucination types. ACH, TSH, and STH indicate action, temporal sequence, and scene transition hallucination, respectively.
Table 1.
Comparison of VIDHALLUC with recent hallucination benchmarks in video understanding. VIDHALLUC is the largest benchmark for evaluating hallucinations, supporting diverse question formats, including binary QA, MCQ, open-ended, and sorting questions, with control pairs and adversarial evaluation to enhance robustness.
| Benchmark | # of Ques. / # of Videos | Binary QA | MCQ | Open-Ended Ques. | Sorting Ques. | Control Pairs | Adversarial |
|---|---|---|---|---|---|---|---|
|
| |||||||
| HallusionBench [17] | 1129 / 346 | ✓ | ✓ | × | × | ✓ | ✓ |
| VideoHallucer [54] | 1,800 / 948 | ✓ | × | × | × | × | ✓ |
| Vript-HAL [59] | 122 / 122 | × | × | ✓ | × | × | × |
| EventHallusion [65] | - / 400 | ✓ | × | ✓ | × | × | × |
|
| |||||||
| VIDHALLUC (Ours) | 9,295 / 5,002 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Acknowledgments
This research was supported by the National Eye Institute (NEI) of the National Institutes of Health (NIH) under award number R01EY034562. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
References
- [1].Achiam Josh, Adler Steven, Agarwal Sandhini, Ahmad Lama, Akkaya Ilge, Florencia Leoni Aleman Diogo Almeida, Altenschmidt Janko, Altman Sam, Anadkat Shyamal, et al. Gpt-4 technical report. arXiv preprint arXiv 2303.08774, 2023. 3 [Google Scholar]
- [2].Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katie, Reynolds Malcolm, Ring Roman, Rutherford Eliza, Cabi Serkan, Han Tengda, Gong Zhitao, Samangooei Sina, Monteiro Marianne, Menick Jacob, Borgeaud Sebastian, Brock Andrew, Nematzadeh Aida, Sharifzadeh Sahand, Binkowski Mikolaj, Barreira Ricardo, Vinyals Oriol, Zisserman Andrew, and Simonyan Karen. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 2 [Google Scholar]
- [3].Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Jared D Kaplan Prafulla Dhariwal, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, Agarwal Sandhini, Ariel Herbert-Voss Gretchen Krueger, Henighan Tom, Child Rewon, Ramesh Aditya, Ziegler Daniel, Wu Jeffrey, Winter Clemens, Hesse Chris, Chen Mark, Sigler Eric, Litwin Mateusz, Gray Scott, Chess Benjamin, Clark Jack, Berner Christopher, Sam McCandlish Alec Radford, Sutskever Ilya, and Amodei Dario. Language models are few-shot learners. In NeurIPS, 2020. 2 [Google Scholar]
- [4].Chandrasegaran Keshigeyan, Gupta Agrim, Hadzic Lea M., Kota Taran, He Jimming, Eyzaguirre Cristobal, Durante Zane, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video-language understanding. In NeurIPS, 2024. 3 [Google Scholar]
- [5].Chen Houlun, Wang Xin, Chen Hong, Zhang Zeyang, Feng Wei, Huang Bin, Jia Jia, and Zhu Wenwu. Verified: A video corpus moment retrieval benchmark for fine-grained video understanding. In NeurIPS, 2024. 3 [Google Scholar]
- [6].Chen Lin, Wei Xilin, Li Jinsong, Dong Xiaoyi, Zhang Pan, Zang Yuhang, Chen Zehui, Duan Haodong, Lin Bin, Tang Zhenyu, Yuan Li, Qiao Yu, Lin Dahua, Zhao Feng, and Wang Jiaqi. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS, 2024. 1, 7 [Google Scholar]
- [7].Chen Sihan, He Xingjian, Guo Longteng, Zhu Xinxin, Wang Weining, Tang Jinhui, and Liu Jing. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv 2304.08345, 2023. 3. [DOI] [PubMed] [Google Scholar]
- [8].Chen Xiang, Wang Chenxi, Xue Yida, Zhang Ningyu, Yang Xiaoyan, Li Qiang, Shen Yue, Liang Lei, Gu Jinjie, and Chen Huajun. Unified hallucination detection for multimodal large language models. In ACL, 2024. 1 [Google Scholar]
- [9].Cheng Zesen, Leng Sicong, Zhang Hang, Xin Yifei, Li Xin, Chen Guanzheng, Zhu Yongxin, Zhang Wenqi, Luo Ziyang, Zhao Deli, and Bing Lidong. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv 2406.07476, 2024. 2, 7, 8 [Google Scholar]
- [10].Cherti Mehdi, Beaumont Romain, Wightman Ross, Wortsman Mitchell, Ilharco Gabriel, Gordon Cade, Schuhmann Christoph, Schmidt Ludwig, and Jitsev Jenia. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023. 2 [Google Scholar]
- [11].Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, et al. Scaling instruction-finetuned language models. JMLR, 2024. 2 [Google Scholar]
- [12].Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, and Hoi Steven. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023. 2 [Google Scholar]
- [13].Devlin Jacob. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv 1810.04805, 2018. 2 [Google Scholar]
- [14].Driess Danny, Xia Fei, Sajjadi Mehdi S. M., Lynch Corey, Chowdhery Aakanksha, Ichter Brian, Wahid Ayzaan, Tompson Jonathan, Vuong Quan, Yu Tianhe, Huang Wenlong, Chebotar Yevgen, Sermanet Pierre, Duckworth Daniel, Levine Sergey, Vanhoucke Vincent, Hausman Karol, Toussaint Marc, Greff Klaus, Zeng Andy, Mordatch Igor, and Florence Pete. Palm-e: an embodied multimodal language model. In ICML, 2023. 2 [Google Scholar]
- [15].Fu Chaoyou, Dai Yuhan, Luo Yongdong, Li Lei, Ren Shuhuai, Zhang Renrui, Wang Zihan, Zhou Chenyu, Shen Yunhang, Zhang Mengdan, Chen Peixian, Li Yanwei, Lin Shaohui, Zhao Sirui, Li Ke, Xu Tong, Zheng Xiawu, Chen Enhong, Ji Rongrong, and Sun Xing. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv 2405.21075, 2024. 3 [Google Scholar]
- [16].Gao Tianyu, Yao Xingcheng, and Chen Danqi. SimCSE: Simple contrastive learning of sentence embeddings. In EMNLP, 2021. 5 [Google Scholar]
- [17].Guan Tianrui, Liu Fuxiao, Wu Xiyang, Xian Ruiqi, Li Zongxia, Liu Xiaoyu, Wang Xijun, Chen Lichang, Huang Furong, Yacoob Yaser, Manocha Dinesh, and Zhou Tianyi. Hallusion-bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024. 1, 3 [Google Scholar]
- [18].Han Tianyang, Lian Qing, Pan Rui, Pi Renjie, Zhang Jipeng, Diao Shizhe, Lin Yong, and Zhang Tong. The instinctive bias: Spurious images lead to hallucination in mllms. arXiv preprint arXiv 2402.03757, 2024. 1 [Google Scholar]
- [19].Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. 3 [Google Scholar]
- [20].Hurst Aaron, Lerer Adam, Goucher Adam P., Perelman Adam, Ramesh Aditya, Clark Aidan, Ostrow AJ, Welihinda Akila, Hayes Alan, et al. Gpt-4o system card. arXiv preprint arXiv: 2410.21276, 2024. 4, 7 [Google Scholar]
- [21].Jin Peng, Takanobu Ryuichi, Zhang Wancai, Cao Xiaochun, and Yuan Li. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024. 2, 7 [Google Scholar]
- [22].Koh Jing Yu, Salakhutdinov Ruslan, and Frsied Daniel. Grounding language models to images for multimodal inputs and outputs. In ICML, 2023. 2 [Google Scholar]
- [23].Lee Seongyun, Sue Hyun Park Yongrae Jo, and Seo Minjoon. Volcano: Mitigating multimodal hallucination through self-feedback guided revision. In ACL, 2024. 2 [Google Scholar]
- [24].Leng Sicong, Zhang Hang, Chen Guanzheng, Li Xin, Lu Shijian, Miao Chunyan, and Bing Lidong. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In CVPR, 2024. 1 [Google Scholar]
- [25].Li Junnan, Li Dongxu, Savarese Silvio, and Hoi Steven. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2 [Google Scholar]
- [26].Li Kunchang, Wang Yali, He Yinan, Li Yizhuo, Wang Yi, Liu Yi, Wang Zun, Xu Jilan, Chen Guo, Luo Ping, Wang Limin, and Qiao Yu. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024. 3 [Google Scholar]
- [27].Li Linjie, Lei Jie, Gan Zhe, Yu Licheng, Chen Yen-Chun, Pillai Rohit, Cheng Yu, Zhou Luowei, Xin Eric Wang William Yang Wang, Tamara Lee Berg Mohit Bansal, Liu Jingjing, Wang Lijuan, and Liu Zicheng. Value: A multitask benchmark for video-and-language understanding evaluation. In NeurIPS, 2024. 3 [Google Scholar]
- [28].Li Yifan, Du Yifan, Zhou Kun, Wang Jinpeng, Zhao Xin, and Wen Ji-Rong. Evaluating object hallucination in large vision-language models. In EMNLP, 2023. 1 [Google Scholar]
- [29].Li Yunxin, Chen Xinyu, Hu Baotian, Wang Longyue, Shi Haoyuan, and Zhang Min. Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv 2406.11303, 2024. 3 [Google Scholar]
- [30].Lin Bin, Ye Yang, Zhu Bin, Cui Jiaxi, Ning Munan, Jin Peng, and Yuan Li. Video-llava: Learning united visual representation by alignment before projection. In EMNLP, 2024. 1, 2, 7, 8 [Google Scholar]
- [31].Lin Ji, Yin Hongxu, Ping Wei, Lu Yao, Molchanov Pavlo, Tao Andrew, Mao Huizi, Kautz Jan, Shoeybi Mohammad, and Han Song. Vila: On pre-training for visual language models. In CVPR, 2024. 2, 7, 8 [Google Scholar]
- [32].Liu Fuxiao, Lin Kevin, Li Linjie, Wang Jianfeng, Yacoob Yaser, and Wang Lijuan. Mitigating hallucination in large multi-modal models via robust instruction tuning. In ICLR, 2024. 2 [Google Scholar]
- [33].Liu Haotian, Li Chunyuan, Wu Qingyang, and Lee Yong Jae Visual instruction tuning. In NeurIPS, 2023. 1, 2 [Google Scholar]
- [34].Liu Haotian, Li Chunyuan, Li Yuheng, and Lee Yong Jae Improved baselines with visual instruction tuning. In CVPR, 2024. 2 [Google Scholar]
- [35].Liu Jiazhen, Fu Yuhan, Xie Ruobing, Xie Runquan, Sun Xingwu, Lian Fengzong, Kang Zhanhui, and Li Xirong. Phd: A prompted visual hallucination evaluation dataset. arXiv preprint arXiv 2403.11116, 2024. 1 [Google Scholar]
- [36].Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, Chen Kai, and Lin Dahua. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024. 3 [Google Scholar]
- [37].Maaz Muhammad, Rasheed Hanoona, Khan Salman, and Khan Fahad. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In ACL, 2024. 2, 7, 8 [Google Scholar]
- [38].Mangalam Karttikeya, Akshulakov Raiymbek, and Malik Jitendra. Egoschema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS, 2023. 1 [Google Scholar]
- [39].Naeem Muhammad Ferjad, Xian Yongqin, Zhai Xiaohua, Hoyer Lukas, Van Gool Luc, and Tombari Federico. Silc: Improving vision language pretraining with self-distillation. In ECCV, 2024. 2 [Google Scholar]
- [40].Nguyen Nguyen, Bi Jing, Vosoughi Ali, Tian Yapeng, Fazli Pooyan, and Xu Chenliang. Oscar: Object state captioning and state change representation. In NAACL, 2024. 3 [Google Scholar]
- [41].Oquab Maxime, Darcet Timothée, Moutakanni Théo, Vo Huy V., Szafraniec Marc, Khalidov Vasil, Fernandez Pierre, Daniel HAZIZA Francisco Massa, Alaaeldin El-Nouby Mido Assran, Ballas Nicolas, Galuba Wojciech, Howes Russell, Huang Po-Yao, Li Shang-Wen, Misra Ishan, Rabbat Michael, Sharma Vasu, Synnaeve Gabriel, Xu Hu, Jegou Herve, Mairal Julien, Labatut Patrick, Joulin Armand, and Bojanowski Piotr. DINOv2: Learning robust visual features without supervision. TMLR, 2024. 2 [Google Scholar]
- [42].Peng Baolin, Li Chunyuan, He Pengcheng, Galley Michel, and Gao Jianfeng. Instruction tuning with gpt-4. arXiv preprint arXiv 2304.03277, 2023. 2 [Google Scholar]
- [43].Radford Alec, Jong Wook Kim Chris Hallacy, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. Learning transferable visual models from natural language supervision. PMLR, 2021. 2 [Google Scholar]
- [44].Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 2 [Google Scholar]
- [45].Rohrbach Anna, Lisa Anne Hendricks Kaylee Burns, Darrell Trevor, and Saenko Kate. Object hallucination in image captioning. arXiv preprint arXiv 1809.02156, 2018. 1 [Google Scholar]
- [46].Sun Quan, Fang Yuxin, Wu Ledell, Wang Xinlong, and Cao Yue. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv 2303.15389, 2023. 2 [Google Scholar]
- [47].Sun Zhiqing, Shen Sheng, Cao Shengcao, Liu Haotian, Li Chunyuan, Shen Yikang, Gan Chuang, Gui Liang-Yan, Wang Yu-Xiong, Yang Yiming, Keutzer Kurt, and Darrell Trevor. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv 2309.14525, 2023. 1, 2 [Google Scholar]
- [48].Tang Yunlong, Bi Jing, Xu Siting, Song Luchuan, Liang Susan, Wang Teng, Zhang Daoan, An Jie, Lin Jingyang, Zhu Rongyi, Vosoughi Ali, Huang Chao, Zhang Zeliang, Liu Pinxin, Feng Mingqian, Zheng Feng, Zhang Jianguo, Luo Ping, Luo Jiebo, and Xu Chenliang. Video understanding with large language models: A survey. arXiv preprint arXiv 2312.17432, 2023. 1 [Google Scholar]
- [49].Team Gemini, Georgiev Petko, Ving Ian Lei Ryan Burnell, Bai Libin, Gulati Anmol, Tanzer Garrett, Vincent Damien, Pan Zhufeng, Wang Shibo, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv 2403.05530, 2024. 7 [Google Scholar]
- [50].Tong Shengbang, Brown II Ellis L, Wu Penghao, Woo Sanghyun, IYER ADITHYAJAIRAM, Akula Sai Charitha, Yang Shusheng, Yang Jihan, Middepogu Manoj, Wang Ziteng, Pan Xichen, Fergus Rob, LeCun Yann, and Xie Saining. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In NeurIPS, 2024. 1 [Google Scholar]
- [51].Tong Shengbang, Liu Zhuang, Zhai Yuexiang, Ma Yi, LeCun Yann, and Xie Saining. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024. 2, 3 [Google Scholar]
- [52].Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Baptiste Rozière Naman Goyal, Hambro Eric, Azhar Faisal, Rodriguez Aurelien, Joulin Armand, Grave Edouard, and Lample Guillaume. Llama: Open and efficient foundation language models. arXiv preprint arXiv 2302.13971, 2023. 2 [Google Scholar]
- [53].Wang Junyang, Wang Yuhang, Xu Guohai, Zhang Jing, Gu Yukai, Jia Haitao, Wang Jiaqi, Xu Haiyang, Yan Ming, Zhang Ji, and Sang Jitao. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv 2311.07397, 2023. 1 [Google Scholar]
- [54].Wang Yuxuan, Wang Yueqian, Zhao Dongyan, Xie Cihang, and Zheng Zilong. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models. arXiv preprint arXiv 2406.16338, 2024. 1, 3 [Google Scholar]
- [55].Wei Jason, Bosma Maarten, Vincent Y Zhao Kelvin Guu, Adams Wei Yu Brian Lester, Du Nan, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv 2109.01652, 2021. 2 [Google Scholar]
- [56].Wu Haoning, Li Dongxu, Chen Bei, and Li Junnan. Longvideobench: A benchmark for long-context interleaved video-language understanding. In NeurIPS, 2024. 3 [Google Scholar]
- [57].Xu Hu, Ghosh Gargi, Huang Po-Yao, Okhonko Dmytro, Aghajanyan Armen, Metze Florian, Zettlemoyer Luke, and Feichtenhofer Christoph. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021. 2 [Google Scholar]
- [58].Xu Lin, Zhao Yilin, Zhou Daquan, Lin Zhijie, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv 2404.16994, 2024. 7 [Google Scholar]
- [59].Yang Dongjie, Huang Suyuan, Lu Chengqiang, Han Xiaodong, Zhang Haoxin, Gao Yan, Hu Yao, and Zhao Hai. Vript: A video is worth thousands of words. In NeurIPS, 2024. 1, 2, 3 [Google Scholar]
- [60].Yin Shukang, Fu Chaoyou, Zhao Sirui, Xu Tong, Wang Hao, Sui Dianbo, Shen Yunhang, Li Ke, Sun Xing, and Chen Enhong. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv 2310.16045, 2023. 2 [Google Scholar]
- [61].Yu Qifan, Li Juncheng, Wei Longhui, Pang Liang, Ye Wentao, Qin Bosheng, Tang Siliang, Tian Qi, and Zhuang Yueting. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In CVPR, 2024. 1, 2 [Google Scholar]
- [62].Yu Tianyu, Yao Yuan, Zhang Haoye, He Taiwen, Han Yifeng, Cui Ganqu, Hu Jinyi, Liu Zhiyuan, Zheng Hai-Tao, Sun Maosong, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024. 2 [Google Scholar]
- [63].Zhai Xiaohua, Mustafa Basil, Kolesnikov Alexander, and Beyer Lucas. Sigmoid loss for language image pre-training. In ICCV, 2023. 2, 8 [Google Scholar]
- [64].Zhang Boqiang, Li Kehan, Cheng Zesen, Hu Zhiqiang, Yuan Yuqian, Chen Guanzheng, Leng Sicong, Jiang Yuming, Zhang Hang, Li Xin, Jin Peng, Zhang Wenqi, Wang Fan, Bing Lidong, and Zhao Deli. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv 2501.13106, 2025. 7 [Google Scholar]
- [65].Zhang Jiacheng, Jiao Yang, Chen Shaoxiang, Chen Jingjing, and Jiang Yu-Gang. Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv 2409.16597, 2024. 3 [Google Scholar]
- [66].Zhang Yuanhan, Li Bo, Liu Haotian, Yong jae Lee Liangke Gui, Fu Di, Feng Jiashi, Liu Ziwei, and Li Chunyuan. LLaVA-NeXT: A strong zero-shot video understanding model, 2024. 2, 7 [Google Scholar]
- [67].Zhao Zhiyuan, Wang Bin, Ouyang Linke, Dong Xiaoyi, Wang Jiaqi, and He Conghui. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv 2311.16839, 2023. 1, 2 [Google Scholar]
- [68].Zhou Luowei, Xu Chenliang, and Corso Jason J. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 3 [Google Scholar]
- [69].Zhou Yiyang, Cui Chenhang, Rafailov Rafael, Finn Chelsea, and Yao Huaxiu. Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv 2402.11411, 2024. 2 [Google Scholar]
- [70].Zhu Bin, Lin Bin, Ning Munan, Yan Yang, Cui Jiaxi, WANG HongFa Yatian Pang, Jiang Wenhao, Zhang Junwu, Li Zongwei, Cai Wan Zhang Zhifeng Li, Liu Wei, and Yuan Li. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In ICLR, 2024. 8 [Google Scholar]
- [71].Zhu Deyao, Chen Jun, Shen Xiaoqian, Li Xiang, and Elhoseiny Mohamed. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 2 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






