Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

Shiqi Zhou; Mingxue Huang; Jiawen Wei; Huihui Fang; Weihua Yang; Hanyi Yu; Yanwu Xu

doi:10.1038/s41598-026-39444-0

. 2026 Feb 13;16:8756. doi: 10.1038/s41598-026-39444-0

Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

Shiqi Zhou ¹, Mingxue Huang ², Jiawen Wei ², Huihui Fang ^3,⁴, Weihua Yang ⁵, Hanyi Yu ^1,^✉, Yanwu Xu ^1,⁴

PMCID: PMC12982513 PMID: 41688579

Abstract

The rapid growth of short-video platforms has reshaped how individuals access health information, but it has also fueled the spread of misinformation and disinformation. Dry eye, a prevalent ocular surface disorder, provides a representative case for examining these challenges. Reliable and scalable methods are urgently needed to identify and mitigate misinformation risks in online health content. We proposed a framework employing Video Large Language Models (VideoLLMs) for automated evaluation of science popularization videos. Three representative VideoLLMs (VideoLLaMA3, QwenVL, and InternVL) were benchmarked using three established instruments: Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V), Global Quality Score (GQS), and Video Information and Quality Index (VIQI). A dataset of 185 Chinese-language videos on dry eye was collected from TikTok and independently annotated by two ophthalmologists. Agreement between VideoLLM-generated scores and expert ratings was quantified using the Intraclass Correlation Coefficient (ICC). Across most metrics, VideoLLMs demonstrated poor agreement with expert annotations (ICC < 0.40), except for the actionability dimension of PEMAT-A/V, where QwenVL and InternVL achieved ICCs of 0.50 and 0.43, respectively, with the experts. This work establishes the first benchmark of VideoLLMs for evaluating ophthalmic science popularization videos and reveals substantial limitations in the performance of current models, with agreement levels falling well short of practical acceptability. Rather than demonstrating readiness for deployment, our open-source framework serves as a reference tool for systematically assessing model behavior, highlighting existing gaps, and motivating further methodological improvements before VideoLLMs can be considered for automated evaluation or governance of medical video content.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-39444-0.

Keywords: Dry eye disease, Science popularization videos, Automatic quality assessment, Video large language model, Digital health misinformation

Subject terms: Computational biology and bioinformatics, Diseases, Health care, Mathematics and computing, Medical research

Introduction

Dry eye is a prevalent and multifactorial ocular surface disorder characterized by tear film instability, ocular discomfort, and visual disturbance, with potential damage to the ocular surface^1,2. It imposes a substantial burden on quality of life and is associated with decreased work productivity, increased healthcare utilization, and significant psychosocial distress^3,4. Recent epidemiological studies indicate that the global prevalence of dry eye ranges from 5 to 50%, affecting hundreds of millions of individuals worldwide⁵, with higher susceptibility among older adults, women⁶, and those with extensive screen exposure^7,8. Effective patient education is crucial for early detection, symptom management, and adherence to preventive strategies, thereby limiting disease progression and preserving visual function^9,10.

With the rapid proliferation of the Internet, an increasing number of patients and the general public seek health information online¹¹. Short-video platforms such as TikTok have emerged as popular sources for health-related knowledge dissemination. However, the open nature of content creation facilitates the rapid spread of unverified or misleading information, far outpacing the capacity of ophthalmic professionals to conduct timely quality assessments^12,13. Inaccurate or incomplete information may lead to delayed medical consultation, inappropriate self-treatment, and exacerbation of ocular surface damage.

Recent advances in Large Language Models (LLMs), offer new opportunities to address this quality control gap. LLMs exhibit remarkable abilities in language comprehension, knowledge retrieval, and reasoning, with successful applications across diverse domains including healthcare^14,15. Their capacity for automated content analysis enables large-scale, cost-effective, and rapid assessment of online health content. Nevertheless, LLM outputs remain probabilistic in nature, and variability in their assessments raises concerns about reproducibility and reliability in medical contexts^16,17. This underscores the urgent need for systematic evaluation and methodological refinement to ensure trustworthy application of LLMs in ophthalmic health content quality assessment.

In this study, we propose a novel framework that leverages Video Large Language Models (VideoLLMs)—a class of LLMs capable of directly processing and interpreting video content—for the automated quality assessment of science popularization videos. The overall schema of our study is presented in Fig. 1. Specifically, we applied the framework to a dataset of online medical science popularization videos on dry eye and evaluated the performance of three mainstream VideoLLMs across widely recognized video content assessment instruments. Comparative analysis with human ratings revealed that general-purpose VideoLLMs exhibited only weak agreement with expert assessments, underscoring the current limitations of off-the-shelf models in specialized medical content evaluation. This work aims to raise awareness of the quality of online ophthalmic educational content and contribute to standardizing the production and dissemination of high-quality eye health information.

Fig. 1 — Flow chart of data collection, annotation, and analysis in this study.

Methods

Data collection

Data collection was conducted on March 30, 2025, using the TikTok app (Mainland China, version 33.7.0). We search with the keyword “dry eye disease” in Chinese under platform’s default “comprehensive ranking” mode to retrieve the top 200 videos from the results page as the initial sample set. To minimize potential bias from the platform’s recommendation algorithm based on historical viewing history, the search was conducted using a newly registered account. This initial sample was rigorously screened according to pre-established inclusion and exclusion criteria.

The inclusion criterion required that the core content of the video must clearly involve science popularization of dry eye, including its diagnosis, treatment plans, nursing measures, or relevant product information. The exclusion criteria were as follows: (1) The video was unrelated to dry eye; (2) The video was not in Chinese; (3) Duplicate content ((including multiple uploads by the same creator or identical content posted by different creators, in which case only the first occurrence was retained); (4) Non-original content (e.g. videos reposted from other accounts or platforms); (5)Incomplete account information (such as missing account name, verification status, or essential creator details); (6) Videos primarily aimed at marketing products related to dry eye (including but not limited to product marketing live streams and their recordings).

Data annotation

After screening, a total of 185 valid video samples were retained for analysis. Two professional ophthalmologists from Shenzhen Eye Hospital individually evaluated these videos using three assessment instruments: Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V)^18,19, Global Quality Score (GQS)²⁰, and Video Information and Quality Index (VIQI)²¹. The specific scoring criteria for each instrument are presented in Tables 1, 2 and 3, respectively.

Table 1.

PEMAT-A/V scoring standard. Extremely short material: refers to videos with a duration of no more than 1 min, or multimedia materials containing 6 or fewer slides or screenshots.

Understandable
Content	1. The material clearly demonstrates its purpose	Disagree = 0, Agree = 1
Vocabulary selection and style	2. The material uses common and everyday language	Disagree = 0, Agree = 1
	3. The use of medical terminology is only to familiarize the audience with these terms. When used, medical terms are defined	Disagree = 0, Agree = 1
	4. The material uses the active voice	Disagree = 0, Agree = 1
Organize	5. The material segments or “blocks” information into short parts	Disagree = 0, Agree = 1, Extremely Short Material = N/A
	6. Each part of the material has informative titles	Disagree = 0, Agree = 1, Extremely Short Material = N/A
	7. Materials present information in logical order	Disagree = 0, Agree = 1
	8. This material provides an abstract	Disagree = 0, Agree = 1, Extremely Short Material = N/A
Layout and Design	9. Use visual cues (such as arrows, boxes, bullet points, bolding, larger fonts, highlighting) to attract attention to key points in the material	Disagree = 0, Agree = 1, Video = N/A
	10. The text on the screen is easy to read	Disagree = 0, Agree = 1, No text or all text with voiceover = N/A
	11. The material enables users to hear the text content clearly (e.g. speaking too fast without any noise)	Disagree = 0, Agree = 1, No voiceover = N/A
Use of visual aids	12. The material uses clear and non cluttered illustrations and photos	Disagree = 0, Agree = 1, No visual assistance = N/A
	13. The material used a simple table with concise and clear row and column headings	Disagree = 0, Agree = 1, No form = N/A
Actionable
	14. The material clearly indicates at least one action that the user can take	Disagree = 0, Agree = 1
	15. The material directly speaks to the user when describing actions	Disagree = 0, Agree = 1
	16. The material decomposes any action into manageable and clear steps	Disagree = 0, Agree = 1
	17. The material explains how to use charts, graphs, tables, or diagrams to take action	Disagree = 0, Agree = 1, No chart, graph, table, diagrams = N/A

Open in a new tab

Table 2.

GQS scoring standard.

Level 1: Poor quality, poor flow of the site, most information missing, not at all useful for patients

Level 2: Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients

Level 3: Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients

Level 4: Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patients

Level 5: Excellent quality and excellent flow, very useful for patients

Open in a new tab

Table 3.

VIQI scoring standard.

VIQI 1 (Information follow): Evaluate whether the presentation of information in the video is coherent, logically clear, whether the information is presented in a reasonable order, and whether the audience can easily follow the video content

VIQI 2 (Information Accuracy): Evaluate whether the information provided in the video is accurate, based on reliable medical evidence or authoritative sources, and whether there are any errors or misleading content

VIQI 3 (Quality): Evaluate the overall quality of the video, including the use of auxiliary tools such as static images, animations, community member interviews, video subtitles, and report summaries to enhance the presentation of information. The usage of each tool accounts for 1 point each

VIQI 4 (Precision): Evaluate the degree of consistency between video titles and content, that is, whether the video content matches the information promised by the title and whether there are any discrepancies between the title and the content

Open in a new tab

The PEMAT-A/V evaluates patient education materials in terms of understandability and actionability. It comprises 17 items divided into two sections: understandability (items 1–13) and actionability (items 14–17). Each item offers three response options: Agree (= 1), Disagree (= 0), and Not Applicable (N/A). Note that N/A does not indicate missing data, but rather serves as a specific response option for cases such as very short videos or the absence of certain visual elements. The score is expressed as a percentage, calculated by summing the number of “Agree” responses and dividing by the total number of applicable items (“Agree” and “Disagree”). A Higher score indicates that the corresponding video has greater understandability and actionability.

The GQS assesses the quality of health information using a 5-point scale ranging from 1 (very poor) to 5 (excellent), based on informational quality, audience engagement, and overall usefulness. VIQI also uses a 5-point Likert scale, ranging from 1 (poor quality) to 5 (excellent quality), to evaluate four aspects of video content: information flow, information accuracy, overall quality (one point each for the use of still images, animation, community interviews, video subtitles, and a report summary), and precision (the alignment between the video title and the content).

VideoLLM framework and prompt design

The VideoLLMs are well-suited for our evaluation task, as they can integrate visual, audio, and textual information for comprehensive video analysis. To reduce potential bias from relying on a single model, we selected three representative models for evaluation: VideoLLaMA3-7B²², Qwen2.5-VL-7B-Instruct²³, and Sa2VA_InternVL2.5_8B²⁴, hereafter referred to as VideoLLaMA, QwenVL, and InternVL, respectively. We uniformly adopted 1 frame per second (1 fps) as the sampling rate which is the default setting across models. Only visual information was used as model input in this study; while audio information was not incorporated.

For each assessment instrument (PEMAT-A/V, GQS, and VIQI), we designed structured prompts strictly following the scoring descriptions. These prompts, together with the corresponding videos, were input into our framework to obtain model-generated scores. The full content of prompts for all assessment instruments is detailed in Supplementary Section A.

Statistical analysis

To comprehensively evaluate the agreement between the model-generated and expert-provided scores, we used the Intraclass Correlation Coefficient (ICC), a reliability index that quantifies the level of agreement among multiple raters or measurement methods and is particularly suitable for repeated measures of continuous variables²⁵. An ICC value between 0.75 and 1.00 indicates excellent agreement, 0.60–0.74 indicates good agreement, 0.59–0.40 indicates moderate agreement, and below 0.40 indicates poor agreement²⁶. Following the guideline from Koo and Li²⁷, we adopted a two-way mixed-effects model, ICC(3,1), to calculate absolute agreement:

where Inline graphic represents the mean square for the subjects being rated, represents the mean square for the raters, represents the mean square error, is the number of subjects, and is the number of raters. Absolute agreement was preferred over consistency because our objective was to assess whether the model’s ratings were identical to those of the expert ophthalmologists, rather than merely exhibiting a systematic correlation. Furthermore, the validity of this analysis was ensured by verifying that the data met the necessary statistical assumptions, specifically the approximate normality of the residuals. All statistical analyses were performed using Python 3.10 and SciPy toolkit set.

Ethics statement

In accordance with local legislation and institutional requirements, the experiment protocol of this study was approved by the ethics committee at Shenzhen Eye Hospital, Shenzhen Eye Medical Center, Southern Medical University (2025KYYJ023-01). The online science popularization videos were collected, accessed, and analyzed in accordance with the platform’s terms of use and all relevant institutional/national regulations. As no human participant was involved in this study, the informed consent was waived.

Results

Characteristics of human expert assessments

Based on the assessment conducted by the team of professional ophthalmologists at Shenzhen Eye Hospital, the collected science popularization videos on dry eye exhibited notable variability in overall quality. According to the PEMAT-A/V results, the videos achieved relatively high on the understandability dimension (mean = 80.64), indicating that most videos conveyed content clearly and followed a coherent logical structure. In contrast, their scores on the actionability dimension were relatively low (mean = 54.46), performance in the actionability domain was considerably lower (mean = 54.46), suggesting a lack of practical guidance for viewers. The mean GQS score was 3.57 (out of 5), reflecting a moderate overall quality level. Further assessment with the VIQI instrument revealed that the videos performed reasonably well in terms of information flow (mean = 3.68), informational accuracy (mean = 3.49), and informational precision (mean = 3.65). However, the dimension of overall quality received the lowest score (mean = 2.63), indicating limited integration of supporting elements such as animations, interviews, captions, or summary reports.

In addition, inter-rater reliability between the two raters (hereafter referred to as Researcher A and Researcher B) was assessed using ICC(3,1). Detailed ICC results for each evaluation dimension are provided in the Supplementary Section B. The ICC values for the PEMAT-A/V dimensions of understandability and actionability were 0.70 and 0.79, respectively; for the GQS, the ICC was 0.82. Regarding the VIQI instrument, the ICCs across the four dimensions were 0.65, 0.62, 0.68, and 0.69, respectively. All corresponding p-values were less than 0.001. According to commonly accepted benchmarks, these results indicate good agreement for all evaluated dimensions, with excellent agreement observed for two measures. The distributions of assessment results from two raters are presented in Figs. 2 and 3. Collectively, these findings support the reliability and robustness of the expert-generated evaluation data used in this study.

Fig. 2 — Comparison of human and model-generated scores for PEMAT-A/V. p-values correspond to the significance of the Intraclass Correlation Coefficients (ICC), testing the null hypothesis that the ICC is zero. All statistical tests were two-sided and p-values are uncorrected for multiple comparisons.

Fig. 3 — Comparison of human and model-generated scores for GQS and VIQI. p-values correspond to the significance of the Intraclass Correlation Coefficients (ICC), testing the null hypothesis that the ICC is zero. All statistical tests were two-sided and p-values are uncorrected for multiple comparisons.

Model performance analysis

We applied three VideoLLMs—VideoLLaMA3, QwenVL, and InternVL—within the proposed framework to generate model-based scores across all assessment instruments. These model-derived scores were then compared with expert ophthalmologists’ ratings using ICC(3,1) as a measure of agreement. The detailed results are summarized in Table 4, while Figs. 2 and 3 illustrate the score distributions and the significant levels of the ICC values. Overall, with the exception of the actionability dimension of PEMAT-A/V, none of the evaluated measures exceeded an ICC of 0.40, indicating poor agreement. This suggests that the current generation of VideoLLMs demonstrates limited performance in assessing the quality of science popularization videos on dry eye.

Table 4.

Intraclass correlation coefficient between human expert annotations and AI video understanding models in multidimensional ratings. Higher values indicate stronger agreement, with values above 0.40 (moderate agreement) highlighted in bold.

Model	Researcher	PEMAT-A/V		GQS	VIQI
Model	Researcher	U	A	GQS	I	II	III	IV
VideoLLaMA	A	0.17	0.39	0.04	0.13	0.19	0.04	0.01
VideoLLaMA	B	0.24	0.34	0.00	0.18	0.18	0.03	0.11
QwenVL	A	0.05	0.50	0.27	0.08	0.10	-0.13	0.03
QwenVL	B	0.11	0.44	0.27	0.09	0.03	-0.05	0.03
InternVL	A	0.05	0.34	0.11	0.07	0.12	-0.08	0.03
InternVL	B	0.16	0.43	0.11	0.13	0.27	-0.10	0.15

Open in a new tab

For PEMAT-A/V, VideoLLaMA exhibited the strongest alignment with human experts on the understandability dimension, achieving statistical significance with both Researcher A and Researcher B (p < 0.05). InternVL also showed statistical significance with Researcher B, while QwenVL did not reach statistical significance with either rater. On the actionability dimension, QwenVL and InternVL achieved moderate agreement with human raters, with ICC values of 0.50 and 0.43, respectively, while VideoLLaMA reached an ICC of 0.39. As shown in Fig. 2, an interesting trend emerged: VideoLLaMA’s mean scores consistently fell between those of Researcher A and Researcher B, QwenVL tended to assign higher ratings than both experts, while InternVL produced consistently lower scores. This suggests systematic tendencies of QwenVL toward overestimation and InternVL toward underestimation relative to human judgments.

For the Global Quality Score, only QwenVL demonstrated statistically significant agreement with both experts (p < 0.001), while VideoLLaMA and InternVL did not show comparable agreement. For the four VIQI dimensions, the results were mixed. In terms of VIQI I (information flow), VideoLLaMA achieved significant agreement with both raters (p < 0.05), while InternVL showed significant alignment with Researcher B only, and QwenVL did not align with either rater. For VIQI II (information accuracy), both VideoLLaMA and InternVL presented significant agreement with both experts (p < 0.05), whereas QwenVL again failed to achieve significance. In terms of VIQI III (quality), none of the models demonstrated significant agreement, and ICC values for InternVL and QwenVL were even negative. Finally, for VIQI IV (precision), only InternVL showed significant agreement with Researcher B (p < 0.05), but the ICC value was very low (0.15). This again highlights that, in the task of assessing the quality of science popularization videos, the performance of current VLLMs remains well below levels acceptable for practical use.

Discussion

Principal findings

This study represents the first systematic attempt to benchmark VideoLLMs for automated quality assessment of science popularization videos on dry eye. For a more comprehensive assessment, we adopted three video quality assessment instruments—PEMAT-A/V, GQS, and VIQI—which were widely used in previous studies on medical video quality^28–31. The proposed VideoLLM-based framework successfully adapted PEMAT-A/V, GQS, and VIQI instruments for automated evaluation, offering a reference methodological solution of applying multimodal LLMs to the task of medical video quality evaluation. However, agreement between model-generated scores and expert assessments was generally limited (ICC < 0.40). An exception was observed in the actionability dimension of PEMAT-A/V, where QwenVL and InternVL achieved moderate agreement with human raters, with ICC values of 0.50 and 0.43, respectively, while VideoLLaMA reached an ICC of 0.39.

To better understand the reasons underlying the observed findings, we explored potential explanations. All three VideoLLMs evaluated in this study were general-purpose models primarily trained on downstream tasks such as object recognition, anomaly detection, scene understanding, and temporal reasoning. None of these tasks directly align with the requirements of science popularization video quality assessment, which is likely the most critical factor contributing to their suboptimal performance. Furthermore, the training data for general-purpose VideoLLMs predominantly comprise natural and dynamic scenes in which information is conveyed through visual cues such as object movements, changes in shape, or interactions among entities. In contrast, science popularization videos are typically presented in relatively static settings, relying heavily on speech, subtitles, charts, and other static visual elements to deliver information. Current VideoLLMs focus primarily on visual modality processing and often lack robust audio processing capabilities; thus, when videos do not contain subtitles, a substantial portion of the informational content is inaccessible to the model. Even if optical character recognition modules are adopted to recognize the content of subtitles, most VideoLLMs perform frame sampling to reduce computational load, which can further lead to information loss. Moreover, the linguistic style of Chinese popular science videos often employs metaphors or analogies (e.g., describing dry eyes as “as parched as a desert”) to enhance audience engagement. While such figurative expressions are valuable in health communication, they may be misinterpreted by VideoLLMs as inaccurate or misleading statements, thereby reducing alignment with expert evaluations. Additionally, for the VIQI precision metric, which evaluates the consistency between a video’s title and its content, model performance was particularly poor. This limitation likely stems from the fact that VideoLLMs do not typically retrieve or process metadata such as video titles, thereby hindering their ability to make accurate judgments on this dimension. These interpretations are offered as plausible contributing factors based on the known characteristics of current general-purpose VideoLLMs. Note that they were not empirically validated as mechanisms within the scope of this study.

Implications

Despite the current limitations of VideoLLMs in reliably assessing video quality, the increasing epidemiological burden of dry eye disease highlights the urgent need for high-quality and accessible online health information, thereby motivating continued efforts to develop reliable and scalable approaches for content evaluation. The increasing trend in the prevalence of dry eye^32–34 imposes a significant and multifaceted burden on patients and healthcare systems worldwide. A systematic review across Europe, North America, and Asia highlighted the substantial economic and humanistic costs of dry eye, emphasizing its growing public health impact³⁵. Evidence from the United States confirmed that dry eye leads to considerable patient-reported burden, including impaired daily functioning and diminished well-being³⁶. Similarly, studies have shown that dry eye substantially reduces quality of life³⁷, with particularly severe consequences in individuals suffering from psychiatric or neurological comorbidities³⁸. The disorder is also strongly associated with reduced work productivity: surveys from Saudi Arabia and international cohorts consistently report absenteeism, presenteeism, and diminished job performance attributable to dry eye³⁹. Collectively, these findings point to a rising prevalence of dry eye and highlight the inability of traditional healthcare resources to fully mitigate its socioeconomic and psychosocial consequences.

Against this backdrop, the Internet has emerged as a critical channel for the dissemination of health information. Systematic reviews of YouTube and other platforms indicate that online media can enhance access to educational resources, foster patient engagement, and support health research^11,40,41. At the same time, studies caution that health-related content on social media often suffers from poor quality, misinformation, or lack of regulation, which can negatively influence patient decision-making and delay appropriate care^12,40,42. Therefore, to maximize the positive role of online science popularization videos while minimizing potential harms, it is essential to develop automated evaluation systems that can assist in reviewing video content. We hope closer integration of advanced VideoLLMs with ophthalmic clinical practice will enable proactive quality control in real-world platforms. For example, TikTok and similar short-video platforms could incorporate AI-based pre-screening of medical content. or prioritize flagging videos with low actionability—such as those describing symptoms of dry eye without offering care recommendations—could also be flagged by AI tools. Such measures would ensure the accuracy and reliability of health information, ultimately supporting more effective patient education and public health interventions.

Contributions, limitations and future plans

This work makes several notable contributions. To the best of our knowledge, it represents the first study to apply VideoLLMs for the automated quality assessment of online medical science popularization videos, thereby bridging the gap between cutting-edge artificial intelligence technologies and digital health governance. By establishing an open-source framework, we provide a foundation for future research and create opportunities for a broader community of scholars and practitioners to advance this field. Moreover, we developed and released a manually annotated dataset of Chinese-language videos on dry eye that updates previous published collections⁴³. Our findings are consistent with prior evidence indicating that the overall quality of online ophthalmic educational videos remains suboptimal.

Nevertheless, several limitations should be acknowledged. First, our experimental results demonstrate the limitations of current VLLMs in assessing the quality of medical science popularization videos. Given their current performance, these models can only serve as a reference for future research and are not suitable for professional or educational guidance. First Second, the scope of this study was restricted to videos related to dry eye. Other common ophthalmic conditions, such as keratitis, glaucoma, and cataract, were not included, which limits the generalizability of our findings across broader ophthalmic domains. Third, our analysis relied on a single frame sampling rate (1 fps) for video question answering, without exploring alternative sampling strategies, which may introduce sampling bias and limit the robustness of the findings. In addition, our dataset was limited to Chinese-language videos. Given that English, Spanish, and other languages are widely used in medical communication, the absence of multilingual data leads to a constraint on the cross-cultural generalizability of our findings. Furthermore, the current version of our VideoLLM-based framework does not fully leverage audio information (e.g., speech features) or metadata (e.g., video titles, creator profiles). This exclusion stems from the inherent architecture of the employed VideoLLMs, which process video content solely through visual information, thereby bypassing audio tracks and meta information. Since instruments such as PEMAT-A/V, GQS, and VIQI incorporate verbal and contextual information in their evaluation criteria, the absence of these modalities constitutes a methodological limitation. This is most evident in VIQI Item 4 (Precision), which mandates an evaluation of the consistency between video titles and their visual content. Without access to the title from metadata, the model is structurally incapable of assessing this alignment or detecting discrepancies. Consequently, this inability to verify title-content consistency likely impairs the model performance, resulting in the attenuated ICC observed.

Future research will address these limitations in several ways. We plan to expand the dataset to encompass multiple ophthalmic diseases as well as multiple languages, in collaboration with international experts. Such an effort will not only strengthen the generalizability of our findings but also support the development of globally applicable evaluation systems. In addition, we aim to enhance our framework by incorporating audio analysis and metadata retrieval modules, as well as by designing refined prompting strategies tailored to medical education contexts. These improvements are expected to yield more reliable evaluations of online medical science popularization videos.

Conclusion

In this study, we proposed and evaluated a novel framework that applies VideoLLMs for the automated assessment of online science popularization videos on dry eye. Our results imply the possibility that established quality assessment instruments (PEMAT-A/V, GQS, and VIQI) could be methodologically adapted for use with multimodal AI models, while also revealing the current limitations of general-purpose VideoLLMs in reliably approximating expert evaluations. By releasing an updated and annotated dataset of Chinese-language educational videos on dry eye and providing an open-source framework, we lay the groundwork for future advances in this emerging field. However, our current results also reflect the limited generalizability of these findings, as contemporary VLLMs remain inadequate for fully automated assessment. While these efforts have the potential to standardize online medical science communication and improve the accessibility, reliability, and impact of digital health education, the current capabilities of VLLMs should be interpreted with caution.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(21.1KB, docx)}

Supplementary Material 2^{(25.3KB, xlsx)}

Author contributions

SZ: Formal analysis, Investigation, Methodology, Software, Writing – original draft. MH: Data curation, Investigation, Methodology, Writing – review & editing. JW: Data curation, Investigation, Methodology, Writing – review & editing. WY: Validation, Visualization. HF: Validation, Visualization. HY: Conceptualization, Supervision, Visualization, Writing – original draft. YX: Resources, Writing – review & editing.

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers: 62571192, 82571272] and Basic and Applied Basic Research Foundation of Guangdong Province [grant number: 2025A1515011627].

Data availability

For academic use, the human quality assessment results can be obtained upon request from the corresponding author (Hanyi Yu, [yhydtc1@gmail.com](mailto:yhydtc1@gmail.com)). The collected video links are provided in the supplementary file. As a precaution against link failure, we have also uploaded the collected videos to a cloud-based data-sharing platform: https://data.mendeley.com/datasets/b4ny54tp4y/1.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Shiqi Zhou, Mingxue Huang and Jiawen Wei: Share first authorship.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Shiqi Zhou, Mingxue Huang and Jiawen Wei have contributed equally to this work.

References

1.Craig, J. P. et al. TFOS DEWS II definition and classification report. Ocul. Surf.15, 276–283 (2017). [DOI] [PubMed] [Google Scholar]
2.Golden, M. I., Meyer, J. J., Zeppieri, M. & Patel, B. C. Dry Eye Syndrome. in StatPearls (StatPearls Publishing, Treasure Island (FL), (2025). [PubMed]
3.Nichols, K. K. et al. Impact of dry eye disease on work productivity, and patients’ satisfaction with over-the-counter dry eye treatments. Invest. Ophthalmol. Vis. Sci.57, 2975–2982 (2016). [DOI] [PubMed] [Google Scholar]
4.Morthen, M. K. et al. The physical and mental burden of dry eye disease: A large population-based study investigating the relationship with health-related quality of life and its determinants. Ocular Surf.21, 107–117 (2021). [DOI] [PubMed] [Google Scholar]
5.Stapleton, F. et al. TFOS DEWS II epidemiology report. Ocul. Surf.15, 334–365 (2017). [DOI] [PubMed] [Google Scholar]
6.Qian, L. & Wei, W. Identified risk factors for dry eye syndrome: A systematic review and meta-analysis. PLoS ONE17, e0271267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Mehra, D. & Galor, A. Digital screen use and dry eye: A review. Asia-Pacific J. Ophthalmol.9, 491–497 (2020). [DOI] [PubMed] [Google Scholar]
8.Cheng, W., Wu, H., Wang, Z. & Liang, L. Association between long-term green space exposure and dry eye in China. Asia-Pacific J. Ophthalmol.14, 100165 (2025). [DOI] [PubMed] [Google Scholar]
9.Verjee, M. A., Brissette, A. R. & Starr, C. E. Dry eye disease: early recognition with guidance on management and treatment for primary care family physicians. Ophthalmol. Ther.9, 877–888 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Rabie, E. A. E. G. A., ElRazkey, J. Y. & Ahmed, H. A. Empowering vision: the impact of nursing-led educational program on patients with dry eye syndrome. BMC Nurs.23, 693 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chen, J. & Wang, Y. Social media use for health purposes: Systematic review. J. Med. Internet Res.23, e17917 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Do Nascimento, I. J. B. et al. Infodemics and health misinformation: a systematic review of reviews. Bull World Health Organ100, 544 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lan, S.-H., Mahmoud, S. & Franson, K. L. A narrative review on the impact of online health misinformation on patients’ behavior and communication. Am. J. Health Behav.48, 564–572 (2024). [Google Scholar]
14.Patel, S. B. & Lam, K. ChatGPT: The future of discharge summaries?. Lancet Digital Health5, e107–e108 (2023). [DOI] [PubMed] [Google Scholar]
15.Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health2, e0000198 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med.30, 2613–2622 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bélisle-Pipon, J.-C. Why we need to be careful with LLMs in medicine. Front. Med. (Lausanne)11, 1495582 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the Patient Education Materials Assessment Tool (PEMAT): A new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns.96, 395–403 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.PEMAT Tool for Audiovisual Materials (PEMAT-A/V). https://www.ahrq.gov/health-literacy/patient-education/pemat-av.html.
20.Bernard, A. et al. a systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am. J. Gastroenterol.102, 2070–2077 (2007). [DOI] [PubMed] [Google Scholar]
21.Lena, Y. & Dindaroğlu, F. Lingual orthodontic treatment: A YouTubeTM video analysis. Angle Orthod.88, 208–214 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zhang, B. et al. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. Preprint at 10.48550/arXiv.2501.13106 (2025).
23.Bai, S. et al. Qwen2.5-VL Technical Report. Preprint at 10.48550/arXiv.2502.13923 (2025).
24.Yuan, H. et al. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. Preprint at 10.48550/arXiv.2501.04001 (2025).
25.Shrout, P. E. & Fleiss, J. L. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull.86, 420–428 (1979). [DOI] [PubMed] [Google Scholar]
26.Cicchetti, D. V. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess.6, 284–290 (1994). [Google Scholar]
27.Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med.15, 155–163 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Morra, S. et al. YouTubeTM as a source of information on bladder pain syndrome: A contemporary analysis. Neurourol. Urodyn.41, 237–245 (2022). [DOI] [PubMed] [Google Scholar]
29.Sezici, Y. L., Gediz, M. & Dindaroğlu, F. Is YouTube an adequate patient resource about orthodontic retention? A cross-sectional analysis of content and quality. Am. J. Orthod. Dentofac. Orthop.161, e72–e79 (2022). [DOI] [PubMed] [Google Scholar]
30.Albayrak, E. & Büyükçavuş, M. H. Does YouTube offer high-quality information? Evaluation of patient experience videos after orthognathic surgery. Angle Orthod.93, 409–416 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Arikan, H. & Erol, E. Quality and reliability evaluation of YouTube® exercises content for temporomandibular disorders. BMC Oral Health25, 301 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Dana, R. et al. Estimated prevalence and incidence of dry eye disease based on coding analysis of a large, all-age united states health care system. Am. J. Ophthalmol.202, 47–54 (2019). [DOI] [PubMed] [Google Scholar]
33.Siffel, C. et al. Burden of dry eye disease in Germany: a retrospective observational study using German claims data. Acta Ophthalmol.98, e504–e512 (2020). [DOI] [PubMed] [Google Scholar]
34.Gu, Q. et al. Trends in health service use for dry eye disease from 2017 to 2021: A real-world analysis of 369,755 outpatient visits. Trans. Vis. Sci. Tech.13, 17 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.McDonald, M., Patel, D. A., Keith, M. S. & Snedecor, S. J. Economic and humanistic burden of dry eye disease in Europe, North America, and Asia: A systematic literature review. Ocul. Surf.14, 144–167 (2016). [DOI] [PubMed] [Google Scholar]
36.Dana, R., Meunier, J., Markowitz, J. T., Joseph, C. & Siffel, C. Patient-reported burden of dry eye disease in the united states: results of an online cross-sectional survey. Am. J. Ophthalmol.216, 7–17 (2020). [DOI] [PubMed] [Google Scholar]
37.Friedman, N. J. Impact of dry eye disease and treatment on quality of life. Curr. Opin. Ophthalmol.21, 310–316 (2010). [DOI] [PubMed] [Google Scholar]
38.Han, S. B., Yang, H. K., Hyon, J. Y. & Wee, W. R. Association of dry eye disease with psychiatric or neurological disorders in elderly patients. Clin Interv Aging12, 785–792 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Binyousef, F. H. et al. Impact of dry eye disease on work productivity among Saudi workers in Saudi Arabia. OPTH15, 2675–2681 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Madathil, K. C., Rivera-Rodriguez, A. J., Greenstein, J. S. & Gramopadhye, A. K. Healthcare information on YouTube: A systematic review. Health Inform. J.21, 173–194 (2015). [DOI] [PubMed] [Google Scholar]
41.Bour, C. et al. The use of social media for health research purposes: scoping review. J. Med. Internet Res.23, e25736 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Suarez-Lledo, V. & Alvarez-Galvez, J. Prevalence of health misinformation on social media: Systematic review. J. Med. Internet Res.23, e17187 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Huang, M. et al. Assessing the quality of educational short videos on dry eye care: a cross-sectional study. Front. Public Health13, 1542278 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(21.1KB, docx)}

Supplementary Material 2^{(25.3KB, xlsx)}

Data Availability Statement

[CR1] 1.Craig, J. P. et al. TFOS DEWS II definition and classification report. Ocul. Surf.15, 276–283 (2017). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Golden, M. I., Meyer, J. J., Zeppieri, M. & Patel, B. C. Dry Eye Syndrome. in StatPearls (StatPearls Publishing, Treasure Island (FL), (2025). [PubMed]

[CR3] 3.Nichols, K. K. et al. Impact of dry eye disease on work productivity, and patients’ satisfaction with over-the-counter dry eye treatments. Invest. Ophthalmol. Vis. Sci.57, 2975–2982 (2016). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Morthen, M. K. et al. The physical and mental burden of dry eye disease: A large population-based study investigating the relationship with health-related quality of life and its determinants. Ocular Surf.21, 107–117 (2021). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Stapleton, F. et al. TFOS DEWS II epidemiology report. Ocul. Surf.15, 334–365 (2017). [DOI] [PubMed] [Google Scholar]

[CR6] 6.Qian, L. & Wei, W. Identified risk factors for dry eye syndrome: A systematic review and meta-analysis. PLoS ONE17, e0271267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Mehra, D. & Galor, A. Digital screen use and dry eye: A review. Asia-Pacific J. Ophthalmol.9, 491–497 (2020). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Cheng, W., Wu, H., Wang, Z. & Liang, L. Association between long-term green space exposure and dry eye in China. Asia-Pacific J. Ophthalmol.14, 100165 (2025). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Verjee, M. A., Brissette, A. R. & Starr, C. E. Dry eye disease: early recognition with guidance on management and treatment for primary care family physicians. Ophthalmol. Ther.9, 877–888 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Rabie, E. A. E. G. A., ElRazkey, J. Y. & Ahmed, H. A. Empowering vision: the impact of nursing-led educational program on patients with dry eye syndrome. BMC Nurs.23, 693 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Chen, J. & Wang, Y. Social media use for health purposes: Systematic review. J. Med. Internet Res.23, e17917 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Do Nascimento, I. J. B. et al. Infodemics and health misinformation: a systematic review of reviews. Bull World Health Organ100, 544 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Lan, S.-H., Mahmoud, S. & Franson, K. L. A narrative review on the impact of online health misinformation on patients’ behavior and communication. Am. J. Health Behav.48, 564–572 (2024). [Google Scholar]

[CR14] 14.Patel, S. B. & Lam, K. ChatGPT: The future of discharge summaries?. Lancet Digital Health5, e107–e108 (2023). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health2, e0000198 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med.30, 2613–2622 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Bélisle-Pipon, J.-C. Why we need to be careful with LLMs in medicine. Front. Med. (Lausanne)11, 1495582 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the Patient Education Materials Assessment Tool (PEMAT): A new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns.96, 395–403 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.PEMAT Tool for Audiovisual Materials (PEMAT-A/V). https://www.ahrq.gov/health-literacy/patient-education/pemat-av.html.

[CR20] 20.Bernard, A. et al. a systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am. J. Gastroenterol.102, 2070–2077 (2007). [DOI] [PubMed] [Google Scholar]

[CR21] 21.Lena, Y. & Dindaroğlu, F. Lingual orthodontic treatment: A YouTubeTM video analysis. Angle Orthod.88, 208–214 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Zhang, B. et al. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. Preprint at 10.48550/arXiv.2501.13106 (2025).

[CR23] 23.Bai, S. et al. Qwen2.5-VL Technical Report. Preprint at 10.48550/arXiv.2502.13923 (2025).

[CR24] 24.Yuan, H. et al. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. Preprint at 10.48550/arXiv.2501.04001 (2025).

[CR25] 25.Shrout, P. E. & Fleiss, J. L. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull.86, 420–428 (1979). [DOI] [PubMed] [Google Scholar]

[CR26] 26.Cicchetti, D. V. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess.6, 284–290 (1994). [Google Scholar]

[CR27] 27.Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med.15, 155–163 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Morra, S. et al. YouTubeTM as a source of information on bladder pain syndrome: A contemporary analysis. Neurourol. Urodyn.41, 237–245 (2022). [DOI] [PubMed] [Google Scholar]

[CR29] 29.Sezici, Y. L., Gediz, M. & Dindaroğlu, F. Is YouTube an adequate patient resource about orthodontic retention? A cross-sectional analysis of content and quality. Am. J. Orthod. Dentofac. Orthop.161, e72–e79 (2022). [DOI] [PubMed] [Google Scholar]

[CR30] 30.Albayrak, E. & Büyükçavuş, M. H. Does YouTube offer high-quality information? Evaluation of patient experience videos after orthognathic surgery. Angle Orthod.93, 409–416 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Arikan, H. & Erol, E. Quality and reliability evaluation of YouTube® exercises content for temporomandibular disorders. BMC Oral Health25, 301 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Dana, R. et al. Estimated prevalence and incidence of dry eye disease based on coding analysis of a large, all-age united states health care system. Am. J. Ophthalmol.202, 47–54 (2019). [DOI] [PubMed] [Google Scholar]

[CR33] 33.Siffel, C. et al. Burden of dry eye disease in Germany: a retrospective observational study using German claims data. Acta Ophthalmol.98, e504–e512 (2020). [DOI] [PubMed] [Google Scholar]

[CR34] 34.Gu, Q. et al. Trends in health service use for dry eye disease from 2017 to 2021: A real-world analysis of 369,755 outpatient visits. Trans. Vis. Sci. Tech.13, 17 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.McDonald, M., Patel, D. A., Keith, M. S. & Snedecor, S. J. Economic and humanistic burden of dry eye disease in Europe, North America, and Asia: A systematic literature review. Ocul. Surf.14, 144–167 (2016). [DOI] [PubMed] [Google Scholar]

[CR36] 36.Dana, R., Meunier, J., Markowitz, J. T., Joseph, C. & Siffel, C. Patient-reported burden of dry eye disease in the united states: results of an online cross-sectional survey. Am. J. Ophthalmol.216, 7–17 (2020). [DOI] [PubMed] [Google Scholar]

[CR37] 37.Friedman, N. J. Impact of dry eye disease and treatment on quality of life. Curr. Opin. Ophthalmol.21, 310–316 (2010). [DOI] [PubMed] [Google Scholar]

[CR38] 38.Han, S. B., Yang, H. K., Hyon, J. Y. & Wee, W. R. Association of dry eye disease with psychiatric or neurological disorders in elderly patients. Clin Interv Aging12, 785–792 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Binyousef, F. H. et al. Impact of dry eye disease on work productivity among Saudi workers in Saudi Arabia. OPTH15, 2675–2681 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Madathil, K. C., Rivera-Rodriguez, A. J., Greenstein, J. S. & Gramopadhye, A. K. Healthcare information on YouTube: A systematic review. Health Inform. J.21, 173–194 (2015). [DOI] [PubMed] [Google Scholar]

[CR41] 41.Bour, C. et al. The use of social media for health research purposes: scoping review. J. Med. Internet Res.23, e25736 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Suarez-Lledo, V. & Alvarez-Galvez, J. Prevalence of health misinformation on social media: Systematic review. J. Med. Internet Res.23, e17187 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Huang, M. et al. Assessing the quality of educational short videos on dry eye care: a cross-sectional study. Front. Public Health13, 1542278 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

Shiqi Zhou

Mingxue Huang

Jiawen Wei

Huihui Fang

Weihua Yang

Hanyi Yu

Yanwu Xu

Abstract

Supplementary Information

Introduction

Fig. 1.

Methods

Data collection

Data annotation

Table 1.

Table 2.

Table 3.

VideoLLM framework and prompt design

Statistical analysis

Ethics statement

Results

Characteristics of human expert assessments

Fig. 2.

Fig. 3.

Model performance analysis

Table 4.

Discussion

Principal findings

Implications

Contributions, limitations and future plans

Conclusion

Supplementary Information

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases