SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Jing Yu; Yuqi Tang; Kehua Feng; Lei Liang; Qiang Zhang; Keyan Ding; Huajun Chen

doi:10.1038/s41597-026-06594-9

. 2026 Feb 26;13:530. doi: 10.1038/s41597-026-06594-9

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Jing Yu ^1,², Yuqi Tang ^1,³, Kehua Feng ^1,⁴, Lei Liang ⁵, Qiang Zhang ³, Keyan Ding ^1,^✉, Huajun Chen ^1,^4,^✉

PMCID: PMC13056992 PMID: 41748623

Abstract

Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.

Subject terms: Interdisciplinary studies, Data acquisition, Databases, Databases

Background & Summary

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language understanding, reasoning, and generation across general domains^1–4. However, their effective application to scientific domains remains hindered by the unique characteristics of scientific knowledge, such as dense technical terminology, implicit assumptions, complex interdependencies, and multimodal data representations, requiring deeper, more structured context understanding^5,6. While LLMs benefit from in-context learning and retrieval-augmented paradigms that allow them to leverage external knowledge without fine-tuning^7–9, these advantages are contingent on the model’s ability to process long, noisy inputs and accurately identify, integrate, and reason over relevant information^10,11. In scientific contexts, where precision and traceability are critical, models must go beyond simple fact retrieval to perform nuanced tasks such as detecting missing or ambiguous information, reconciling conflicting evidence, and drawing context-grounded inferences.

To assess these capabilities, several benchmarks have been developed for evaluating context understanding in long or domain-specific settings. LongICLBench¹¹ evaluates in-context learning across tasks like relation extraction and NER, while LongBench¹² and its successor LongBench-V2¹³ offer cross-domain evaluations in law, finance, and code, focusing on identification and inference over lengthy texts. RepoQA¹⁰ targets codebase-level understanding, emphasizing real-world software documentation navigation. In the scientific domain, datasets like ChemLit-QA¹⁴ provide question-answer-context triplets in chemistry, and CHEMRAG-BENCH¹⁵ evaluates retrieval-augmented generation by dynamically sourcing chemical contexts. Despite these efforts, existing benchmarks often focus narrowly on single domains (e.g., chemistry), rely solely on unstructured text, and emphasize direct question answering, offering limited coverage of the full spectrum of scientific reasoning skills. As summarized in Table 1, current datasets fall short in three key dimensions: (1) disciplinary scope, with most confined to one or a few domains; (2) data modality diversity, typically limited to plain text; and (3) evaluative depth, often omitting critical competencies such as information-absence detection and multi-source integration. None comprehensively support structured tables, knowledge graphs, and unstructured text within a unified framework across multiple scientific disciplines.

Table 1.

Comparison of SciCUEval with existing benchmark datasets.

Datasets	Contexts	Domains	Data Modalities	Question Types	Evaluation Competencies	# Nums
LongICLBench¹¹	✓	General	Text	QA	Identification	2,618
LongBench¹²	✓	General, Code	Text	QA	Identification, Integration	4,750
LongBench V2¹³	✓	General, Law, Finance	Text	MCQ	Identification, Integration, Inference	503
RGB⁴⁰	✓	General	Text	QA	Identification, Detec., Integration, Inference	1,000
ChemLit-QA¹⁴	✓	Chemistry	Text	QA	Identification, Detec., Inference	1,054
CHEMRAG-BENCH¹⁵	×	Chemistry	Text	QA, MCQ	Identification, Inference	1,932
SciCUEval	✓	Comprehensive Science	Text, Table, KG	QA, MCQ, T/F, CC	Identification, Detec., Integration, Inference	11,343

Open in a new tab

Question Types: QA (Question Answering), MCQ (Multiple Choice Question), T/F (True/False Question), and CC(Cloze Completion).

To bridge this gap, we introduce SciCUEval, a comprehensive benchmark designed to rigorously evaluate LLMs’ scientific context understanding under realistic conditions. As shown in Fig. 1, SciCUEval spans ten sub-datasets across biology, chemistry, physics, biomedicine, and materials science, sourced from high-quality scientific literature and databases. It incorporates three core data modalities (unstructured text, structured tables, and semi-structured knowledge graphs) reflecting the heterogeneous nature of real-world scientific data^16–18. The dataset supports four question types: open-ended QA, multiple-choice (MCQ), true/false (T/F), and cloze completion (CC), enabling evaluation across four essential competencies: (1) Relevant Information Identification, (2) Information-Absence Detection, (3) Multi-source Information Integration, and (4) Context-Aware Inference.

Fig. 1 — Overview of the SciCUEval dataset. It spans five scientific domains, supports three data modalities (structured tables, knowledge graphs, and unstructured text), and includes four question types. Data are collected from high-quality scientific sources. The dataset enables evaluation across four key competencies: (1) relevant information identification, (2) information-absence detection, (3) multi-source information integration, and (4) context-aware inference.

By cohesively unifying breadth of domain coverage, diversity of data modalities, and depth of reasoning evaluation, SciCUEval offers a uniquely holistic testbed for scientific context understanding in LLMs. With over 10K carefully curated instances, it provides a robust and scalable testbed for evaluating both generalization and specialization in scientific LLMs. The contributions of this paper are summarized as follows:

Establishing a scientific context understanding benchmark: We establish a benchmark to evaluate context understanding capabilities of LLMs in scientific domains, serving as a standardized evaluation suite for assessing LLMs’ capabilities in identifying, detecting, integrating, and reasoning over scientific contexts.
Constructing a diverse set of domain-specific context understanding datasets: We construct ten sub-datasets across multiple disciplines, encompassing various data modalities and a wide range of question types to ensure comprehensive evaluation.
Extensive evaluation and analysis of LLMs: We systematically evaluate and analyze the performance of various state-of-the-art LLMs on SciCUEval, highlighting their strengths and limitations, and offering insights for improvement.

Methods

This section presents the dataset construction process in SciCUEval, which involves formulating evaluation competencies, collecting scientific data, generating questions and answers, and conducting rigorous verification.

Evaluation Competencies

We formulate four capabilities essential for evaluating LLMs in scientific contexts:

Relevant Information Identification: LLMs must effectively distinguish between relevant information and extraneous noise within complex scientific contexts. In real-world scenarios, scientific data often contains contextually related but non-essential information. Given a scientific question q and a context C = {c₁, c₂, …, c_n}, where exactly one entry c^* ∈ C contains all and only the information necessary to answer q, and all other entries in C\{c^*} are irrelevant, a model demonstrates Relevant Information Identification if its answer depends solely on c^* and remains unchanged regardless of the presence or content of the other entries.
Information-absence Detection: The ability to abstain from responding when all contextual data is irrelevant or unreliable. Scientific queries often require precise evidence. Given a scientific question q and a context C, none of the entries in C contain sufficient or reliable evidence to support a factually correct answer to q, a model demonstrates Information-absence Detection if it refrains from generating an answer and instead outputs a rejection statement (e.g., “I cannot answer the question due to insufficient information in the data.”).
Multi-source Information Integration: Scientific queries often require synthesizing data from multiple sources. Given a scientific question q and a context C, if answering q requires combining evidence from two or more distinct entries c_i, c_j, ⋯ ∈ C such that no single entry alone suffices to derive the correct answer, LLMs must aggregate and compare information across different contextual segments to generate precise and contextually grounded answers.
Context-aware Inference: The capability to perform logical inference based on context. Given a scientific question q and a context C, if the correct answer to q is not explicitly stated in any entry of C but can be logically inferred through scientific reasoning based on the relationships, patterns, or implicit knowledge contained in C, a model demonstrates Context-aware Inference if it derives a factually correct and domain-consistent conclusion that follows from C under the principles of scientific logic.

These four competencies form the foundation of SciCUEval and provide a systematic evaluation framework for context understanding in scientific domains.

Source Data Collection

Scientific Domains

To evaluate LLMs in scientific contexts comprehensively, we curate data from diverse scientific domains, including Biology, Chemistry, Physics, Biomedicine, and Materials Science. These disciplines are fundamental to modern science, encompassing a wide range of knowledge from theoretical principles to experimental data, ensuring a broad and representative assessment of long-context understanding capabilities in scientific applications.

Data Modalities

To support a broad evaluation of scientific context understanding capabilities, we consider three distinct data modalities: (1) Unstructured Text, (2) Structured Tables, and (3) Semi-structured Knowledge Graphs and Ontologies. Each modality presents unique challenges, enabling a holistic assessment of LLMs’ retrieval, synthesis, reasoning, and integration capabilities in scientific domains. Specifically, unstructured text corpora consist of scientific literature, allowing LLMs to retrieve, synthesize, and infer domain knowledge from textual sources. We collect thousands of recent research papers and experimental protocols from open-access repositories such as arXiv. Structured tables contain numerical and categorical data, testing LLMs’ capacity to interpret structured knowledge, recognize contextual dependencies, and perform quantitative reasoning. We collect nuclear data from IAEA¹⁹, material properties from Material Project²⁰, and molecular and protein properties from PubChem²¹. Knowledge graphs and ontologies encode scientific knowledge as interconnected entities and relational networks. Although distinct in their formal definitions, we integrate both to rigorously assess LLMs’ abilities in handling complex relational inference, hierarchical knowledge traversal, and cross-domain knowledge synthesis. We collect well-established scientific semi-structured databases, including Gene Ontology²² for gene-function relationships, HIPPIE²³ for protein-protein interactions, PharmKG²⁴ for drug-target interactions, and PrimeKG²⁵ for clinical entity relationships.

Data Generation

Building on the collected source data, we construct corresponding datasets tailored to assess the proposed four competencies. The data generation pipeline is illustrated in Fig. 2, which involves (1) question generation, (2) noise injection, and (3) quality control.

Fig. 2 — Illustration of data generation pipeline in SciCUEval, mainly consisting of question generation, noise injection, and quality control.

Question Generation

We first sample a subset of texts, table rows, or KG triples from the full databases,

D = {d_{i} ∣ d_{i} = ϕ (S), i = 1, 2, \dots, N},

where $S$ denotes the original large-scale scientific data source, ϕ represents the sampling operation (e.g., random sampling), and each d_i corresponds to a single atomic data record or a small cluster of closely related entries that collectively convey coherent information. Given each sampled data unit d_i, we then apply a question generation process to produce a corresponding question-answer pair (q_i, a_i). This process is formulated as:

(q_{i}, a_{i}) = f_{LLM} (p \oplus d_{i}),

where p is a carefully designed textual prompt, ⊕ denotes simple string concatenation, and f_LLM denotes a large language model that processes this combined input to generate both a question q_i and its correct answer a_i. The LLM is guided by the prompt to produce questions that are not only grounded in d_i but also exhibit semantic diversity and contextual relevance.

For each evaluation competency, we craft the prompt to ensure the questions are reasonable and aligned with the required competencies. The detailed prompts are provided in Supplementary Information.

These generated questions span various formats, including Q&A, multiple-choice, content completion, and true/false validation, offering a robust assessment of context understanding abilities.

Noise Injection

Following question generation, we extract noisy information from the source data and inject them into the context. Specifically, we inject semantically similar yet unrelated entries into the context using an embedding-based similarity search. Formally, each sample before noise injection is denoted as x_i = (q_i, a_i, d_i). To select distractor entries, we first compute embeddings for all candidate entries in the dataset $D$ using Sentence-BERT²⁶.

h_{d_{j}} = f_{S - BERT} (d_{j}), \forall d_{j} \in D,

where f_S-BERT denotes the embedding function and $h_{d_{j}}$ is the embedding vector for entry d_j. We then employ the cosine similarity to efficiently retrieve the Top-k entries most similar to the selected entry d_i:

N_{i} = \underset{d_{j} \in D \ {d_{i}}}{T o p ‐ k} sim (h_{d_{i}}, h_{d_{j}}),

where $sim (\cdot, \cdot)$ denotes the cosine similarity between embedding vectors. The final sample after noise injection is represented as ${\tilde{x}}_{i} = (q_{i}, a_{i}, d_{i} \oplus N_{i})$ , where $N_{i}$ contains the k selected distractor entries used to augment the context. We sample k ∈ [200, 300] for structured tables and KGs, and set k = 5 for unstructured text. To inject a more challenging and realistic noise distribution, we also employ a hybrid strategy combining embedding similarity with N-gram overlap filtering. This ensures that the injected noise is semantically related (confusing to the model) but lexically distinct (factually unrelated). First, we compute embeddings using Sentence-BERT and retrieve the Top-K candidates based on cosine similarity:

C = \underset{d_{j} \in D \ {d_{i}}}{T o p ‐ k} sim (h_{d_{i}}, h_{d_{j}}),

For each candidate $c \in C$ , we compute the normalized N-gram overlap for n ∈ {1, 2, 3}:

{Overlap}_{n} (d_{i}, c) = \frac{∣ {n‐grams}_{n} (d_{i}) \cap {n‐grams}_{n} (c) ∣}{∣ {n‐grams}_{n} (d_{i}) \cup {n‐grams}_{n} (c) ∣}

Define the maximum overlap score as:

MaxOverlap (d_{i}, c) = \max_{n = 1, 2, 3} {Overlap}_{n} (d_{i}, c) .

We accept a candidate c as a valid noise entry only if the maximum overlap is below a threshold τ (set to 0.1):

MaxOverlap (d_{i}, c) \leq τ

From the filtered set, we randomly select k entries to form the final noisy context. To validate this approach, we conducted an ablation study (Supplement Table 3) comparing this hybrid method against a cosine-only baseline. Through this approach, the injected noise closely mimics the type of confusing or misleading information that LLMs may encounter in practice, ensuring the benchmark dataset remains both challenging and realistic.

Quality Control

To maintain the rigor of the constructed dataset, we implement a two-stage verification process to ensure data quality:

(1) LLM as a Judge: We use advanced LLMs (e.g., GPT-4o) as automated evaluators to verify that each generated answer is both extractable and logically deducible from the relevant context, ensuring factual consistency and relevance. The prompt is presented below. Inline graphic

(2) Human Expert Evaluation: To further ensure the quality and accuracy of the generated data, we subjected the data that passed the initial LLM validation to manual review by five PhD-level researchers with strong STEM backgrounds. These experts were tasked with thoroughly evaluating each instance based on the following three criteria:

Whether the question effectively tests the intended competency, ensuring that it is aligned with the targeted skill or knowledge domain and accurately reflects the underlying construct it aims to assess.
Whether the question is expressed clearly and logically, such that its wording is unambiguous, coherent, and easily understood by both human evaluators and automated systems, thereby minimizing potential misinterpretations
Whether the given contexts fully support the given answer and is factually correct, which requires that the answer not only directly derives from or can be logically inferred based on the supporting materials, but also adheres to facts and scientific evidence. Together, these criteria are designed to ensure the evaluation process’s validity, clarity, and reliability.

Only instances that received “Yes” for all three criteria were accepted. After the experts reviewed all instances, the results revealed that 90.83% of the instances met the required high-quality standards. We compensated them based on the number of questions reviewed. We paid $30 for every 100 questions reviewed, totaling $3,300 for 11k questions. The entire review process took one week.

Data Records

Dataset Overview

Based on the data collection, generation, and quality control processes described above, we construct the final SciCUEval dataset, encompassing ten distinct sub-datasets (two unstructured text datasets, four structured table datasets, and four knowledge graph datasets), covering diverse scientific fields. Each sub-dataset contains approximately a thousand high-quality questions, leading to a total of 11,343 questions across the entire dataset. An overview of the dataset composition is presented in Table 2, summarizing the scientific data source, modality, and question distribution for each sub-dataset.

Table 2.

Statistic of the SciCUEval dataset, which comprises ten sub-datasets derived from diverse scientific data.

Sub-dataset	Domain	Source	Modality	#Info. Indent.	# Abs. Detec.	# Info. Integ.	# Con. Infer.	# Total
MatText	Materials	arXiv	Text	216	146	222	356	940
BioText	Biology	Biorxiv	Text	236	97	318	317	968
MatTab	Materials	Material Project	Table	299	150	287	200	936
IaeaTab	Physics	IAEA	Table	442	222	286	180	1130
ProtTab	Biology	Pubchem	Table	496	249	327	180	1252
MolTab	Chemistry	Pubchem	Table	516	259	350	180	1305
GoKG	Biology	Gene Ontology	KG	507	254	239	180	1180
HipKG	Biology	HIPPIE	KG	470	236	319	140	1165
PhaKG	Biomedicine	PharmKG	KG	512	256	281	168	1217
PriKG	Biomedicine	PrimeKG	KG	410	205	382	253	1250

Open in a new tab

Dataset Format and Structure

The SciCUEval dataset is publicly available at 10.6084/m9.figshare.29924687²⁷. We make the dataset publicly available under the CC BY license, allowing broad reuse with attribution.

Each entry is formatted as a JSON object and includes the following standardized attributes:

question: A string containing the text of the question to be answered.
answer: A list containing the correct answer(s) to the question.
Source: A string specifying the name of the data source from which the question is derived.
URL: A string providing the link to the original source database.
database type: A string describing the modality of the database (e.g., Table, Knowledge Graph, Text).
domain: A string indicating the scientific domain to which the problem belongs.
Correct Context: A list of strings providing the relevant context entries necessary to determine the correct answer.
Context: A list of strings containing additional background information or dataset excerpts that may include both relevant and irrelevant entries.
id: A string uniquely identifying the problem instance in the dataset.

Technical Validation

In this section, we evaluate the performance of various LLMs on SciCUEval, and provide a thorough analysis of their capabilities in understanding scientific contexts.

Experimental Setup

Models

We select 18 advanced LLMs with different scales for this paper, including 3 proprietary models (GPT-4o², Claude-3.5-Sonnet²⁸, GPT-4o-mini), 11 open-source general-purpose models (DeepSeek-V3²⁹, DeepSeek-R1³⁰, Qwen2.5-7B-Instruct⁴, Qwen3-8B (with explicit thinking)³¹, Llama3.1-8B-Instruct, Llama3.1-70B-Instruct³, Llama-4-Maverick-17B-128E-Instruct, Llama-4-Scout-17B-16E-Instruct³², Ministral-8B-Instruct³³, GLM4-9B-Chat³⁴, Gemma2-9B-it³⁵, 4 open-source scientific-domain models (SciGLM-6B³⁶, LlaSMol-Mistral-7B³⁷, ChemLLM-7B-Chat³⁸, ChemDFM-v1.5-8B³⁹). LLama3.1-70b-it, Llama4-Scout and Llama4-Maverick are accessed via the NVIDIA NIM APIs. DeepSeek-R1, DeepSeek-V3, and proprietary models are accessed via their official APIs. The remaining open-source models are deployed locally on a server equipped with two NVIDIA GeForce RTX 3090 GPUs. The detailed information of these models is shown in Table 3.

Table 3.

Overview of the LLMs assessed in our experimental framework.

Model Name	Creator	Domain	#Parameters	Access	URL
GPT-4o	OpenAI	General	undisclosed	Official API	https://chat.openai.com
GPT-4o-mini	OpenAI	General	undisclosed	Official API	https://chat.openai.com
Claude-3.5-Sonnet	Anthropic	General	undisclosed	Official API	https://claude.ai
DeepSeek-V3	DeepSeek	General	671B	Official API	https://www.deepseek.com
DeepSeek-R1	DeepSeek	General	671B	Official API	https://www.deepseek.com
Llama3.1-70B-it	Meta	General	70B	NVIDIA NIM API	https://llama.meta.com/llama3
Llama3.1-8B-it	Meta	General	8B	Weights	https://llama.meta.com/llama3
Llama4-Maverick	Meta	General	400B(17B × 128 Experts)	NVIDIA NIM API	https://www.llama.com/models/llama-4/
Llama4-Scout	Meta	General	109B(17B × 16 Experts)	NVIDIA NIM API	https://www.llama.com/models/llama-4/
Qwen2.5-7B-it	Alibaba	General	7B	Weights	https://qwenlm.github.io/
Qwen3-8B	Alibaba	General	8B	Weights	https://qwenlm.github.io/
GLM4-9B-Chat	Tsinghua&Zhipu	General	9B	Weights	https://huggingface.co/THUDM/glm-4-9b-chat
Gemma2-9B-it	Google	General	9B	Weights	https://ai.google.dev/gemma
Ministral-8B-it	Mistral	General	8B	Weights	https://mistral.ai
ChemDFM-v1.5-8B	SJTU	Chemistry	8B	Weights	https://github.com/OpenDFM/ChemDFM
SciGLM-6B	Tsinghua	Science	6B	Weights	https://github.com/THUDM/SciGLM
LlaSMol-Mistral-7B	OSU	Chemistry	7B	Weights	https://huggingface.co/osunlp/LlaSMol-Mistral-7B
ChemLLM-7B-chat	ShanghaiAILab	Chemistry	7B	Weights	https://huggingface.co/AI4Chem/ChemLLM-7B-Chat

Open in a new tab

Settings

To ensure a fair evaluation across all models, we adopt a unified prompting template that standardizes input formatting. Specifically, each input consists of a system prompt that specifies the question type and defines answer format requirements, contexts, and a question designed to assess one of the four core competencies in SciCUEval. Given that each question in SciCUEval has a deterministic answer, we adopt accuracy as the evaluation metric for all question types across the tasks of relevant information identification, multi-source information integration, and context-aware inference. For the task of information-absence detection, we use the rejection rate as the evaluation metric.

Overall Results

Table 4 shows the performance of 18 LLMs on SciCUEval across ten sub-datasets. The results highlight several important trends. First, models with explicit reasoning capabilities demonstrate clear advantages. The reasoning-augmented open-source model DeepSeek-R1 achieves the highest overall accuracy, outperforming both proprietary models (e.g., GPT-4o) and other general-purpose open-source models. Qwen3-8B with explicit thinking also performs strongly, ranking second among open-source models. This indicates that incorporating structured reasoning pathways, even without domain-specific pretraining, can significantly enhance performance in scientific tasks. Second, proprietary models such as GPT-4o and Claude-3.5-Sonnet maintain competitive performance, especially in unstructured text-based domains (e.g., BioText, MatText), benefiting from their superior language understanding and generalization capabilities Table 5. Third, scientific-domain LLMs such as ChemDFM-v1.5-8B and SciGLM-6B exhibit substantially lower performance across all datasets. Although designed for scientific domains, these models tend to lack general reasoning capacity and robustness across modalities Table 6. Fourth, there is a strong positive correlation between model size and effectiveness. Large-scale models (e.g., GPT-4o, Llama4-Maverick, and Llama3.1-70B) consistently outperform their smaller counterparts (e.g., GPT-4o-mini, Llama4-Scout, and Llama3.1-8B) across most domains.

Table 4.

Performance of LLMs across ten sub-datasets on SciCUEval.

Models	MatTab	IaeaTab	MolTab	ProtTab	PhaKG	PriKG	HipKG	GoKG	BioText	MatText	Overall
GPT-4o	68.79	56.55	55.79	52.64	55.71	54.80	68.50	74.32	79.03	64.57	61.52
GPT-4o-mini	40.71	38.85	46.67	44.57	40.59	52.64	65.20	73.14	79.24	65.00	54.57
Claude-3.5-Sonnet	48.48	42.03	67.91	52.22	50.94	45.96	75.78	84.07	58.06	61.49	59.20
DeepSeek-R1	73.71	71.89	74.69	72.44	58.66	58.20	69.66	79.18	74.79	63.09	69.72
Qwen3-8B	63.14	59.20	70.80	69.33	55.16	54.48	74.68	73.98	69.73	55.11	64.69
DeepSeek-V3	56.62	54.07	59.85	52.08	52.18	51.92	63.42	72.29	66.74	45.31	57.50
Llama4-Maverick	46.47	47.79	48.20	43.61	48.32	49.28	64.81	72.71	63.02	54.15	53.65
Llama4-Scout	48.93	47.70	46.90	46.17	39.77	48.08	59.57	66.27	61.88	48.51	51.16
Llama3.1-70B-it	38.25	39.73	44.44	41.29	44.70	44.00	59.31	70.17	66.53	51.91	49.80
Qwen2.5-7B-it	28.10	32.65	43.30	39.46	36.15	45.60	53.99	62.46	68.18	59.68	46.62
GLM4-9B-Chat	31.41	25.84	47.82	43.45	36.03	44.56	57.94	60.51	67.77	50.96	46.46
Llama3.1-8B-it	28.85	34.34	42.76	39.78	38.29	46.56	52.62	59.32	64.26	49.36	45.50
Gemma2-9B-it	32.91	32.21	42.91	37.22	37.39	50.48	56.57	57.29	37.77	29.67	42.21
Ministral-8B-it	23.08	19.12	35.56	37.38	22.76	37.92	48.51	52.88	48.14	45.32	37.58
ChemDFM-v1.5-8B	33.65	31.15	35.56	36.82	40.43	30.72	49.70	56.44	26.11	18.91	36.80
SciGLM-6B	11.86	11.50	17.70	14.94	19.56	20.88	21.46	28.31	44.17	31.35	21.58
LlaSMol-Mistral-7B	13.35	12.83	16.55	14.70	21.54	19.84	22.83	29.92	33.13	20.98	20.42
ChemLLM-7B-chat	3.42	6.02	8.81	8.15	13.45	5.92	5.15	15.51	39.94	22.67	12.16

Open in a new tab

Underline results indicate the best results among all models. Bold results indicate the best results in each category.

Table 5.

Evaluation results of LLMs across four competencies on SciCUEval.

Models	Info.	Abs.	Info.	Con.	Overall
Models	Ident.	Detec.	Integ.	Infer.	Overall
GPT-4o	89.72	19.51	54.90	65.97	61.52
GPT-4o-mini	77.81	14.71	47.54	57.68	54.57
Claude-3.5-Sonnet	82.95	49.10	50.85	47.29	59.20
DeepSeek-R1	94.13	11.75	72.78	79.44	69.72
Qwen3-8B	88.29	43.57	53.87	64.38	64.69
DeepSeek-V3	90.80	6.05	49.80	60.53	57.50
Llama4-Maverick	77.61	7.16	54.16	58.52	53.65
Llama4-Scout	71.38	25.42	43.65	54.95	51.16
Llama3.1-70B-it	81.05	6.87	45.44	47.76	49.80
Qwen2.5-7B-it	69.92	9.02	42.95	50.34	46.62
GLM4-9B	71.48	2.53	50.78	43.82	46.46
Llama3.1-8B-it	75.34	5.88	41.47	39.99	45.50
Gemma2-9B-it	66.97	2.38	28.74	48.22	42.20
Ministral-8B-it	56.80	4.76	31.32	39.62	37.58
ChemDFM-v1.5-8B	45.49	19.31	22.23	46.40	36.80
SciGLM-6B	33.24	9.01	18.00	29.48	21.58
LlaSMol-Mistral-7B	31.96	6.83	14.63	26.59	20.42
ChemLLM-7B-Chat	20.29	4.09	16.85	7.57	12.16

Open in a new tab

Underline results indicate the best results among all models. Bold results indicate the best results in each category.

Table 6.

Evaluation results of LLMs across three modalities on SciCUEval.

Models	Text	Table	KG	Overall
GPT-4o	71.91	55.91	63.13	61.52
GPT-4o-mini	72.22	42.98	55.84	54.57
Claude-3.5-Sonnet	59.75	53.41	64.99	59.20
DeepSeek-R1	69.00	73.19	66.64	69.72
Qwen3-8B	62.53	66.02	64.27	64.69
DeepSeek-V3	56.18	55.68	59.77	57.50
Llama4-Maverick	58.65	46.51	58.54	53.65
Llama4-Scout	55.29	47.31	53.22	51.16
Llama3.1-70B-it	59.33	41.19	54.30	49.80
Qwen2.5-7B-it	69.93	36.58	49.38	46.24
GLM4-9B-Chat	59.49	37.94	49.46	46.46
Llama3.1-8B-it	56.92	37.08	49.06	45.50
Gemma2-9B-it	34.27	36.73	50.23	42.21
Ministral-8B-it	46.75	29.50	41.24	37.58
ChemDFM-v1.5-8B	22.92	34.44	44.05	36.80
SciGLM-6B	38.48	14.25	22.51	21.58
LlaSMol-Mistral-7B	27.74	14.49	23.45	20.42
ChemLLM-7B-chat	32.28	6.86	10.01	12.16

Open in a new tab

Underline results indicate the best results among all models. Bold results indicate the best results in each category.

Evaluation Results of Four Competencies

Relevant Information Identification

This competency measures a model’s ability to locate and select the correct pieces of information from the provided context. As shown in Fig. 3, DeepSeek-R1 leads all of the evaluated models, suggesting that explicit reasoning mechanisms effectively enhance factual grounding. DeepSeek-V3, GPT-4o, and Qwen-8B also exhibit strong performance, showing the advantages of in-context retrieval capabilities. In contrast, scientific-domain LLMs exhibit notably weaker performance in identifying relevant contexts across diverse scenarios.

Information-absence Detection

“This metric evaluates whether a model appropriately withholds an answer when the required information is absent. While Claude-3.5-Sonnet and Qwen-8B demonstrate relatively high accuracy, most models struggle significantly in this competency, with scores below 20%. To investigate this, we conducted an ablation study on prompting strategies (Supplement Table 4). The results reveal a stark contrast: while models possess the intrinsic capability to detect information absence–achieving high rejection rates when explicitly instructed to refuse–they exhibit a severe tendency to hallucinate under standard prompting conditions. This highlights the critical risk of “overconfidence” in current models, which may pose potential safety risks in the scientific domain where users may not always provide defensive instructions.”

Multi-source Information Integration

This task assesses a model’s ability to synthesize information from multiple entries to construct a complete answer. DeepSeek-R1 achieves the highest performance, followed by GPT-4o and Llama4-Maverick, suggesting that these models are better equipped to combine multiple data points into coherent and accurate answers. Among smaller open-source models, GLM4-9B shows a competitive score, even surpassing DeepSeek-V3 in this competency. However, scientific LLMs significantly lag behind, indicating that while these domain-specific models are adept at handling scientific text, they face challenges in effectively synthesizing information from diverse sources.

Context-aware Inference

This capability reflects a model’s ability to reason over contextually relevant information. DeepSeek-R1 achieves the highest performance, and GPT-4o and Qwen3-8B also perform well, indicating that large-scale models and those enhanced with explicit thinking benefit significantly in contextual reasoning tasks. In contrast, models like Claude-3.5-Sonnet and DeepSeek-V3 show moderate capabilities but fall behind on deeper inference. Scientific-domain models such as ChemLLM-7B-Chat and SciGLM perform poorly, indicating limited general reasoning capabilities despite domain specialization.

Evaluation Results of Three Modalities

Figure 4 show the performance of LLMs across three modalities: Text, Table, and KG, highlighting their strengths and weaknesses in handling diverse scientific data formats. Overall, LLMs tend to perform best on the text modality, reflecting their strong natural language understanding and generation capabilities. Notably, some smaller models even exceed their average overall performance on text, indicating a relative maturity in handling unstructured text data. In the table modality, reasoning-augmented models demonstrate a clear advantage, suggesting that explicit reasoning mechanisms and the ability to process structured data significantly benefit table understanding. In contrast, general LLMs show weaker performance on tables, implying challenges in leveraging tabular structure with traditional language modeling approaches. Similarly, for KG data, models with reasoning enhancements again lead, reflecting their ability to leverage relational and graph-structured information effectively. Additionally, domain-specific scientific models consistently underperform across all three modalities compared to general-purpose or reasoning-augmented models.

Fig. 4 — Performance of LLMs across three modalities on SciCUEval.

Findings

Our experimental results highlight three key discrepancies in the performance of LLMs on scientific context understanding tasks, underscoring fundamental challenges that require further advancements.

Competency Discrepancy

The evaluation results reveal notable disparities across the four core competencies. While top-performing models exhibit relatively strong capabilities in identifying relevant information, they struggle with information-absence detection–the ability to abstain from answering when faced with unreliable or insufficient evidence. This suggests that models prioritize generating responses over ensuring accuracy, increasing the risk of hallucinations in scientific applications where factual correctness is critical. To address this, models should incorporate uncertainty quantification techniques, such as confidence-based rejection mechanisms and calibrated probability outputs, to enhance their ability to detect and reject misleading retrievals. Furthermore, reinforcement learning with human feedback and verification-based prompting strategies could help improve the model’s reliability in rejecting incorrect information.

Modality Discrepancy

LLMs exhibit relatively better performance on unstructured text compared to structured tables and KGs. This suggests that existing models rely heavily on linguistic patterns and semantic context rather than structured inference and multi-modal data integration. The weaker performance on tables and KGs indicates a bottleneck in structured data comprehension, where models struggle to extract, synthesize, and infer information effectively from unstructured data. To bridge this gap, models need improved cross-modal alignment, integrating structured data reasoning into their training paradigm. Techniques such as joint pretraining on text, tables, and graphs could enhance structured data understanding.

Specialized vs. General Model Discrepancy

Although scientific LLMs are explicitly designed for knowledge-intensive tasks, our evaluation shows that they often fail to outperform general-purpose models on our dataset. This suggests that current specialized models lack sufficient reasoning depth, robustness, and flexibility to fully leverage domain knowledge in complex scientific contexts. Their narrower training scope may limit generalization across diverse data modalities and reasoning challenges. To improve their contextual understanding, scientific models should incorporate targeted fine-tuning using curated scientific evidence and adopt domain-aware prompt engineering strategies. These approaches can help balance deep specialization with the adaptability required to tackle a broad range of scientific tasks, enhancing their effectiveness across diverse scenarios.

Supplementary information

Supplementary Information^{(55.5KB, pdf)}

Acknowledgements

This work is funded by New Generation Artificial Intelligence - National Science and Technology Major Project (2025ZD0122801, H.C.), NSFC62301480 (K.D.), NSFC62302433 (Q.Z.), NSFCU23A20496 (Q.Z.), and Ant Group Research Fund (K.D.). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of Chinese Academy of Sciences.

Author contributions

J.Y. and Y.T. contributed equally to this work. J.Y., Y.T., and K.D. conceived the study and designed the method. J.Y. and Y.T. implemented the method, conducted the experiments, and performed the result analyses. K.F. and M.R. provided guidance and assistance in dataset construction. J.Y., Y.T., Q.Z., K.D., and H.C. wrote and revised the manuscript. K.D. and H.C. supervised the entire project. All authors wrote the paper, reviewed it, and approved the final paper.

Data availability

The complete SciCUEval dataset used in this study has been uploaded to figshare (10.6084/m9.figshare.29924687)²⁷.

Code availability

The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Keyan Ding, Email: dingkeyan@zju.edu.cn.

Huajun Chen, Email: huajunsir@zju.edu.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-026-06594-9.

References

1.Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).
2.OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).
3.Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).
4.Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).
5.Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).
6.Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).
7.Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems33, 9459–9474 (2020). [Google Scholar]
8.Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).
9.Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).
10.Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).
11.Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).
12.Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).
13.Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).
14.Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology6, 020601 (2025). [Google Scholar]
15.Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).
16.Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).
17.He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems37, 132876–132907 (2024). [Google Scholar]
18.Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).
19.International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.
20.The Materials Project. Available at: https://next-gen.materialsproject.org.
21.National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.
22.Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.
23.Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016). [DOI] [PMC free article] [PubMed]
24.Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics22, bbaa344 (2021). [DOI] [PubMed] [Google Scholar]
25.Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data10(1), 67 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).
27.Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare10.6084/m9.figshare.29924687 (2025). [DOI] [PMC free article] [PubMed]
28.Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).
29.DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).
30.Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).
31.Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).
32.Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.
33.Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).
34.Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).
35.Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).
36.Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).
37.Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).
38.Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).
39.Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).
40.Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence38(16), 17754–17762 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(55.5KB, pdf)}

Data Availability Statement

The complete SciCUEval dataset used in this study has been uploaded to figshare (10.6084/m9.figshare.29924687)²⁷.

The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).

[CR1] 1.Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).

[CR2] 2.OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).

[CR3] 3.Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).

[CR4] 4.Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).

[CR5] 5.Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).

[CR6] 6.Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).

[CR7] 7.Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems33, 9459–9474 (2020). [Google Scholar]

[CR8] 8.Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).

[CR9] 9.Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).

[CR10] 10.Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).

[CR11] 11.Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).

[CR12] 12.Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).

[CR13] 13.Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).

[CR14] 14.Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology6, 020601 (2025). [Google Scholar]

[CR15] 15.Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).

[CR16] 16.Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).

[CR17] 17.He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems37, 132876–132907 (2024). [Google Scholar]

[CR18] 18.Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).

[CR19] 19.International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.

[CR20] 20.The Materials Project. Available at: https://next-gen.materialsproject.org.

[CR21] 21.National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.

[CR22] 22.Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.

[CR23] 23.Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016). [DOI] [PMC free article] [PubMed]

[CR24] 24.Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics22, bbaa344 (2021). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data10(1), 67 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).

[CR27] 27.Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare10.6084/m9.figshare.29924687 (2025). [DOI] [PMC free article] [PubMed]

[CR28] 28.Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).

[CR29] 29.DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).

[CR30] 30.Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).

[CR31] 31.Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).

[CR32] 32.Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.

[CR33] 33.Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).

[CR34] 34.Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).

[CR35] 35.Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).

[CR36] 36.Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).

[CR37] 37.Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).

[CR38] 38.Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).

[CR39] 39.Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).

[CR40] 40.Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence38(16), 17754–17762 (2024). [Google Scholar]

PERMALINK

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Jing Yu

Yuqi Tang

Kehua Feng

Lei Liang

Qiang Zhang

Keyan Ding

Huajun Chen

Abstract

Background & Summary

Table 1.

Fig. 1.

Methods

Evaluation Competencies

Source Data Collection

Scientific Domains

Data Modalities

Data Generation

Fig. 2.

Question Generation

Noise Injection

Quality Control

Data Records

Dataset Overview

Table 2.

Dataset Format and Structure

Technical Validation

Experimental Setup

Models

Table 3.

Settings

Overall Results

Table 4.

Table 5.

Table 6.

Evaluation Results of Four Competencies

Relevant Information Identification

Fig. 3.

Information-absence Detection

Multi-source Information Integration

Context-aware Inference

Evaluation Results of Three Modalities

Fig. 4.

Findings

Competency Discrepancy

Modality Discrepancy

Specialized vs. General Model Discrepancy

Supplementary information

Acknowledgements

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases