Abstract
Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
Subject terms: Interdisciplinary studies, Data acquisition, Databases, Databases
Background & Summary
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language understanding, reasoning, and generation across general domains1–4. However, their effective application to scientific domains remains hindered by the unique characteristics of scientific knowledge, such as dense technical terminology, implicit assumptions, complex interdependencies, and multimodal data representations, requiring deeper, more structured context understanding5,6. While LLMs benefit from in-context learning and retrieval-augmented paradigms that allow them to leverage external knowledge without fine-tuning7–9, these advantages are contingent on the model’s ability to process long, noisy inputs and accurately identify, integrate, and reason over relevant information10,11. In scientific contexts, where precision and traceability are critical, models must go beyond simple fact retrieval to perform nuanced tasks such as detecting missing or ambiguous information, reconciling conflicting evidence, and drawing context-grounded inferences.
To assess these capabilities, several benchmarks have been developed for evaluating context understanding in long or domain-specific settings. LongICLBench11 evaluates in-context learning across tasks like relation extraction and NER, while LongBench12 and its successor LongBench-V213 offer cross-domain evaluations in law, finance, and code, focusing on identification and inference over lengthy texts. RepoQA10 targets codebase-level understanding, emphasizing real-world software documentation navigation. In the scientific domain, datasets like ChemLit-QA14 provide question-answer-context triplets in chemistry, and CHEMRAG-BENCH15 evaluates retrieval-augmented generation by dynamically sourcing chemical contexts. Despite these efforts, existing benchmarks often focus narrowly on single domains (e.g., chemistry), rely solely on unstructured text, and emphasize direct question answering, offering limited coverage of the full spectrum of scientific reasoning skills. As summarized in Table 1, current datasets fall short in three key dimensions: (1) disciplinary scope, with most confined to one or a few domains; (2) data modality diversity, typically limited to plain text; and (3) evaluative depth, often omitting critical competencies such as information-absence detection and multi-source integration. None comprehensively support structured tables, knowledge graphs, and unstructured text within a unified framework across multiple scientific disciplines.
Table 1.
Comparison of SciCUEval with existing benchmark datasets.
| Datasets | Contexts | Domains | Data Modalities | Question Types | Evaluation Competencies | # Nums |
|---|---|---|---|---|---|---|
| LongICLBench11 | ✓ | General | Text | QA | Identification | 2,618 |
| LongBench12 | ✓ | General, Code | Text | QA | Identification, Integration | 4,750 |
| LongBench V213 | ✓ | General, Law, Finance | Text | MCQ | Identification, Integration, Inference | 503 |
| RGB40 | ✓ | General | Text | QA | Identification, Detec., Integration, Inference | 1,000 |
| ChemLit-QA14 | ✓ | Chemistry | Text | QA | Identification, Detec., Inference | 1,054 |
| CHEMRAG-BENCH15 | × | Chemistry | Text | QA, MCQ | Identification, Inference | 1,932 |
| SciCUEval | ✓ | Comprehensive Science | Text, Table, KG | QA, MCQ, T/F, CC | Identification, Detec., Integration, Inference | 11,343 |
Question Types: QA (Question Answering), MCQ (Multiple Choice Question), T/F (True/False Question), and CC(Cloze Completion).
To bridge this gap, we introduce SciCUEval, a comprehensive benchmark designed to rigorously evaluate LLMs’ scientific context understanding under realistic conditions. As shown in Fig. 1, SciCUEval spans ten sub-datasets across biology, chemistry, physics, biomedicine, and materials science, sourced from high-quality scientific literature and databases. It incorporates three core data modalities (unstructured text, structured tables, and semi-structured knowledge graphs) reflecting the heterogeneous nature of real-world scientific data16–18. The dataset supports four question types: open-ended QA, multiple-choice (MCQ), true/false (T/F), and cloze completion (CC), enabling evaluation across four essential competencies: (1) Relevant Information Identification, (2) Information-Absence Detection, (3) Multi-source Information Integration, and (4) Context-Aware Inference.
Fig. 1.
Overview of the SciCUEval dataset. It spans five scientific domains, supports three data modalities (structured tables, knowledge graphs, and unstructured text), and includes four question types. Data are collected from high-quality scientific sources. The dataset enables evaluation across four key competencies: (1) relevant information identification, (2) information-absence detection, (3) multi-source information integration, and (4) context-aware inference.
By cohesively unifying breadth of domain coverage, diversity of data modalities, and depth of reasoning evaluation, SciCUEval offers a uniquely holistic testbed for scientific context understanding in LLMs. With over 10K carefully curated instances, it provides a robust and scalable testbed for evaluating both generalization and specialization in scientific LLMs. The contributions of this paper are summarized as follows:
Establishing a scientific context understanding benchmark: We establish a benchmark to evaluate context understanding capabilities of LLMs in scientific domains, serving as a standardized evaluation suite for assessing LLMs’ capabilities in identifying, detecting, integrating, and reasoning over scientific contexts.
Constructing a diverse set of domain-specific context understanding datasets: We construct ten sub-datasets across multiple disciplines, encompassing various data modalities and a wide range of question types to ensure comprehensive evaluation.
Extensive evaluation and analysis of LLMs: We systematically evaluate and analyze the performance of various state-of-the-art LLMs on SciCUEval, highlighting their strengths and limitations, and offering insights for improvement.
Methods
This section presents the dataset construction process in SciCUEval, which involves formulating evaluation competencies, collecting scientific data, generating questions and answers, and conducting rigorous verification.
Evaluation Competencies
We formulate four capabilities essential for evaluating LLMs in scientific contexts:
Relevant Information Identification: LLMs must effectively distinguish between relevant information and extraneous noise within complex scientific contexts. In real-world scenarios, scientific data often contains contextually related but non-essential information. Given a scientific question q and a context C = {c1, c2, …, cn}, where exactly one entry c* ∈ C contains all and only the information necessary to answer q, and all other entries in C\{c*} are irrelevant, a model demonstrates Relevant Information Identification if its answer depends solely on c* and remains unchanged regardless of the presence or content of the other entries.
Information-absence Detection: The ability to abstain from responding when all contextual data is irrelevant or unreliable. Scientific queries often require precise evidence. Given a scientific question q and a context C, none of the entries in C contain sufficient or reliable evidence to support a factually correct answer to q, a model demonstrates Information-absence Detection if it refrains from generating an answer and instead outputs a rejection statement (e.g., “I cannot answer the question due to insufficient information in the data.”).
Multi-source Information Integration: Scientific queries often require synthesizing data from multiple sources. Given a scientific question q and a context C, if answering q requires combining evidence from two or more distinct entries ci, cj, ⋯ ∈ C such that no single entry alone suffices to derive the correct answer, LLMs must aggregate and compare information across different contextual segments to generate precise and contextually grounded answers.
Context-aware Inference: The capability to perform logical inference based on context. Given a scientific question q and a context C, if the correct answer to q is not explicitly stated in any entry of C but can be logically inferred through scientific reasoning based on the relationships, patterns, or implicit knowledge contained in C, a model demonstrates Context-aware Inference if it derives a factually correct and domain-consistent conclusion that follows from C under the principles of scientific logic.
These four competencies form the foundation of SciCUEval and provide a systematic evaluation framework for context understanding in scientific domains.
Source Data Collection
Scientific Domains
To evaluate LLMs in scientific contexts comprehensively, we curate data from diverse scientific domains, including Biology, Chemistry, Physics, Biomedicine, and Materials Science. These disciplines are fundamental to modern science, encompassing a wide range of knowledge from theoretical principles to experimental data, ensuring a broad and representative assessment of long-context understanding capabilities in scientific applications.
Data Modalities
To support a broad evaluation of scientific context understanding capabilities, we consider three distinct data modalities: (1) Unstructured Text, (2) Structured Tables, and (3) Semi-structured Knowledge Graphs and Ontologies. Each modality presents unique challenges, enabling a holistic assessment of LLMs’ retrieval, synthesis, reasoning, and integration capabilities in scientific domains. Specifically, unstructured text corpora consist of scientific literature, allowing LLMs to retrieve, synthesize, and infer domain knowledge from textual sources. We collect thousands of recent research papers and experimental protocols from open-access repositories such as arXiv. Structured tables contain numerical and categorical data, testing LLMs’ capacity to interpret structured knowledge, recognize contextual dependencies, and perform quantitative reasoning. We collect nuclear data from IAEA19, material properties from Material Project20, and molecular and protein properties from PubChem21. Knowledge graphs and ontologies encode scientific knowledge as interconnected entities and relational networks. Although distinct in their formal definitions, we integrate both to rigorously assess LLMs’ abilities in handling complex relational inference, hierarchical knowledge traversal, and cross-domain knowledge synthesis. We collect well-established scientific semi-structured databases, including Gene Ontology22 for gene-function relationships, HIPPIE23 for protein-protein interactions, PharmKG24 for drug-target interactions, and PrimeKG25 for clinical entity relationships.
Data Generation
Building on the collected source data, we construct corresponding datasets tailored to assess the proposed four competencies. The data generation pipeline is illustrated in Fig. 2, which involves (1) question generation, (2) noise injection, and (3) quality control.
Fig. 2.
Illustration of data generation pipeline in SciCUEval, mainly consisting of question generation, noise injection, and quality control.
Question Generation
We first sample a subset of texts, table rows, or KG triples from the full databases,
| 1 |
where denotes the original large-scale scientific data source, ϕ represents the sampling operation (e.g., random sampling), and each di corresponds to a single atomic data record or a small cluster of closely related entries that collectively convey coherent information. Given each sampled data unit di, we then apply a question generation process to produce a corresponding question-answer pair (qi, ai). This process is formulated as:
| 2 |
where p is a carefully designed textual prompt, ⊕ denotes simple string concatenation, and fLLM denotes a large language model that processes this combined input to generate both a question qi and its correct answer ai. The LLM is guided by the prompt to produce questions that are not only grounded in di but also exhibit semantic diversity and contextual relevance.
For each evaluation competency, we craft the prompt to ensure the questions are reasonable and aligned with the required competencies. The detailed prompts are provided in Supplementary Information.
These generated questions span various formats, including Q&A, multiple-choice, content completion, and true/false validation, offering a robust assessment of context understanding abilities.
Noise Injection
Following question generation, we extract noisy information from the source data and inject them into the context. Specifically, we inject semantically similar yet unrelated entries into the context using an embedding-based similarity search. Formally, each sample before noise injection is denoted as xi = (qi, ai, di). To select distractor entries, we first compute embeddings for all candidate entries in the dataset using Sentence-BERT26.
| 3 |
where fS-BERT denotes the embedding function and is the embedding vector for entry dj. We then employ the cosine similarity to efficiently retrieve the Top-k entries most similar to the selected entry di:
| 4 |
where denotes the cosine similarity between embedding vectors. The final sample after noise injection is represented as , where contains the k selected distractor entries used to augment the context. We sample k ∈ [200, 300] for structured tables and KGs, and set k = 5 for unstructured text. To inject a more challenging and realistic noise distribution, we also employ a hybrid strategy combining embedding similarity with N-gram overlap filtering. This ensures that the injected noise is semantically related (confusing to the model) but lexically distinct (factually unrelated). First, we compute embeddings using Sentence-BERT and retrieve the Top-K candidates based on cosine similarity:
| 5 |
For each candidate , we compute the normalized N-gram overlap for n ∈ {1, 2, 3}:
| 6 |
Define the maximum overlap score as:
| 7 |
We accept a candidate c as a valid noise entry only if the maximum overlap is below a threshold τ (set to 0.1):
| 8 |
From the filtered set, we randomly select k entries to form the final noisy context. To validate this approach, we conducted an ablation study (Supplement Table 3) comparing this hybrid method against a cosine-only baseline. Through this approach, the injected noise closely mimics the type of confusing or misleading information that LLMs may encounter in practice, ensuring the benchmark dataset remains both challenging and realistic.
Quality Control
To maintain the rigor of the constructed dataset, we implement a two-stage verification process to ensure data quality:
(1) LLM as a Judge: We use advanced LLMs (e.g., GPT-4o) as automated evaluators to verify that each generated answer is both extractable and logically deducible from the relevant context, ensuring factual consistency and relevance. The prompt is presented below.

(2) Human Expert Evaluation: To further ensure the quality and accuracy of the generated data, we subjected the data that passed the initial LLM validation to manual review by five PhD-level researchers with strong STEM backgrounds. These experts were tasked with thoroughly evaluating each instance based on the following three criteria:
Whether the question effectively tests the intended competency, ensuring that it is aligned with the targeted skill or knowledge domain and accurately reflects the underlying construct it aims to assess.
Whether the question is expressed clearly and logically, such that its wording is unambiguous, coherent, and easily understood by both human evaluators and automated systems, thereby minimizing potential misinterpretations
Whether the given contexts fully support the given answer and is factually correct, which requires that the answer not only directly derives from or can be logically inferred based on the supporting materials, but also adheres to facts and scientific evidence. Together, these criteria are designed to ensure the evaluation process’s validity, clarity, and reliability.
Only instances that received “Yes” for all three criteria were accepted. After the experts reviewed all instances, the results revealed that 90.83% of the instances met the required high-quality standards. We compensated them based on the number of questions reviewed. We paid $30 for every 100 questions reviewed, totaling $3,300 for 11k questions. The entire review process took one week.
Data Records
Dataset Overview
Based on the data collection, generation, and quality control processes described above, we construct the final SciCUEval dataset, encompassing ten distinct sub-datasets (two unstructured text datasets, four structured table datasets, and four knowledge graph datasets), covering diverse scientific fields. Each sub-dataset contains approximately a thousand high-quality questions, leading to a total of 11,343 questions across the entire dataset. An overview of the dataset composition is presented in Table 2, summarizing the scientific data source, modality, and question distribution for each sub-dataset.
Table 2.
Statistic of the SciCUEval dataset, which comprises ten sub-datasets derived from diverse scientific data.
| Sub-dataset | Domain | Source | Modality | #Info. Indent. | # Abs. Detec. | # Info. Integ. | # Con. Infer. | # Total |
|---|---|---|---|---|---|---|---|---|
| MatText | Materials | arXiv | Text | 216 | 146 | 222 | 356 | 940 |
| BioText | Biology | Biorxiv | Text | 236 | 97 | 318 | 317 | 968 |
| MatTab | Materials | Material Project | Table | 299 | 150 | 287 | 200 | 936 |
| IaeaTab | Physics | IAEA | Table | 442 | 222 | 286 | 180 | 1130 |
| ProtTab | Biology | Pubchem | Table | 496 | 249 | 327 | 180 | 1252 |
| MolTab | Chemistry | Pubchem | Table | 516 | 259 | 350 | 180 | 1305 |
| GoKG | Biology | Gene Ontology | KG | 507 | 254 | 239 | 180 | 1180 |
| HipKG | Biology | HIPPIE | KG | 470 | 236 | 319 | 140 | 1165 |
| PhaKG | Biomedicine | PharmKG | KG | 512 | 256 | 281 | 168 | 1217 |
| PriKG | Biomedicine | PrimeKG | KG | 410 | 205 | 382 | 253 | 1250 |
Dataset Format and Structure
The SciCUEval dataset is publicly available at 10.6084/m9.figshare.2992468727. We make the dataset publicly available under the CC BY license, allowing broad reuse with attribution.
Each entry is formatted as a JSON object and includes the following standardized attributes:
question: A string containing the text of the question to be answered.
answer: A list containing the correct answer(s) to the question.
Source: A string specifying the name of the data source from which the question is derived.
URL: A string providing the link to the original source database.
database type: A string describing the modality of the database (e.g., Table, Knowledge Graph, Text).
domain: A string indicating the scientific domain to which the problem belongs.
Correct Context: A list of strings providing the relevant context entries necessary to determine the correct answer.
Context: A list of strings containing additional background information or dataset excerpts that may include both relevant and irrelevant entries.
id: A string uniquely identifying the problem instance in the dataset.
Technical Validation
In this section, we evaluate the performance of various LLMs on SciCUEval, and provide a thorough analysis of their capabilities in understanding scientific contexts.
Experimental Setup
Models
We select 18 advanced LLMs with different scales for this paper, including 3 proprietary models (GPT-4o2, Claude-3.5-Sonnet28, GPT-4o-mini), 11 open-source general-purpose models (DeepSeek-V329, DeepSeek-R130, Qwen2.5-7B-Instruct4, Qwen3-8B (with explicit thinking)31, Llama3.1-8B-Instruct, Llama3.1-70B-Instruct3, Llama-4-Maverick-17B-128E-Instruct, Llama-4-Scout-17B-16E-Instruct32, Ministral-8B-Instruct33, GLM4-9B-Chat34, Gemma2-9B-it35, 4 open-source scientific-domain models (SciGLM-6B36, LlaSMol-Mistral-7B37, ChemLLM-7B-Chat38, ChemDFM-v1.5-8B39). LLama3.1-70b-it, Llama4-Scout and Llama4-Maverick are accessed via the NVIDIA NIM APIs. DeepSeek-R1, DeepSeek-V3, and proprietary models are accessed via their official APIs. The remaining open-source models are deployed locally on a server equipped with two NVIDIA GeForce RTX 3090 GPUs. The detailed information of these models is shown in Table 3.
Table 3.
Overview of the LLMs assessed in our experimental framework.
| Model Name | Creator | Domain | #Parameters | Access | URL |
|---|---|---|---|---|---|
| GPT-4o | OpenAI | General | undisclosed | Official API | https://chat.openai.com |
| GPT-4o-mini | OpenAI | General | undisclosed | Official API | https://chat.openai.com |
| Claude-3.5-Sonnet | Anthropic | General | undisclosed | Official API | https://claude.ai |
| DeepSeek-V3 | DeepSeek | General | 671B | Official API | https://www.deepseek.com |
| DeepSeek-R1 | DeepSeek | General | 671B | Official API | https://www.deepseek.com |
| Llama3.1-70B-it | Meta | General | 70B | NVIDIA NIM API | https://llama.meta.com/llama3 |
| Llama3.1-8B-it | Meta | General | 8B | Weights | https://llama.meta.com/llama3 |
| Llama4-Maverick | Meta | General | 400B(17B × 128 Experts) | NVIDIA NIM API | https://www.llama.com/models/llama-4/ |
| Llama4-Scout | Meta | General | 109B(17B × 16 Experts) | NVIDIA NIM API | https://www.llama.com/models/llama-4/ |
| Qwen2.5-7B-it | Alibaba | General | 7B | Weights | https://qwenlm.github.io/ |
| Qwen3-8B | Alibaba | General | 8B | Weights | https://qwenlm.github.io/ |
| GLM4-9B-Chat | Tsinghua&Zhipu | General | 9B | Weights | https://huggingface.co/THUDM/glm-4-9b-chat |
| Gemma2-9B-it | General | 9B | Weights | https://ai.google.dev/gemma | |
| Ministral-8B-it | Mistral | General | 8B | Weights | https://mistral.ai |
| ChemDFM-v1.5-8B | SJTU | Chemistry | 8B | Weights | https://github.com/OpenDFM/ChemDFM |
| SciGLM-6B | Tsinghua | Science | 6B | Weights | https://github.com/THUDM/SciGLM |
| LlaSMol-Mistral-7B | OSU | Chemistry | 7B | Weights | https://huggingface.co/osunlp/LlaSMol-Mistral-7B |
| ChemLLM-7B-chat | ShanghaiAILab | Chemistry | 7B | Weights | https://huggingface.co/AI4Chem/ChemLLM-7B-Chat |
Settings
To ensure a fair evaluation across all models, we adopt a unified prompting template that standardizes input formatting. Specifically, each input consists of a system prompt that specifies the question type and defines answer format requirements, contexts, and a question designed to assess one of the four core competencies in SciCUEval. Given that each question in SciCUEval has a deterministic answer, we adopt accuracy as the evaluation metric for all question types across the tasks of relevant information identification, multi-source information integration, and context-aware inference. For the task of information-absence detection, we use the rejection rate as the evaluation metric.
Overall Results
Table 4 shows the performance of 18 LLMs on SciCUEval across ten sub-datasets. The results highlight several important trends. First, models with explicit reasoning capabilities demonstrate clear advantages. The reasoning-augmented open-source model DeepSeek-R1 achieves the highest overall accuracy, outperforming both proprietary models (e.g., GPT-4o) and other general-purpose open-source models. Qwen3-8B with explicit thinking also performs strongly, ranking second among open-source models. This indicates that incorporating structured reasoning pathways, even without domain-specific pretraining, can significantly enhance performance in scientific tasks. Second, proprietary models such as GPT-4o and Claude-3.5-Sonnet maintain competitive performance, especially in unstructured text-based domains (e.g., BioText, MatText), benefiting from their superior language understanding and generalization capabilities Table 5. Third, scientific-domain LLMs such as ChemDFM-v1.5-8B and SciGLM-6B exhibit substantially lower performance across all datasets. Although designed for scientific domains, these models tend to lack general reasoning capacity and robustness across modalities Table 6. Fourth, there is a strong positive correlation between model size and effectiveness. Large-scale models (e.g., GPT-4o, Llama4-Maverick, and Llama3.1-70B) consistently outperform their smaller counterparts (e.g., GPT-4o-mini, Llama4-Scout, and Llama3.1-8B) across most domains.
Table 4.
Performance of LLMs across ten sub-datasets on SciCUEval.
| Models | MatTab | IaeaTab | MolTab | ProtTab | PhaKG | PriKG | HipKG | GoKG | BioText | MatText | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 68.79 | 56.55 | 55.79 | 52.64 | 55.71 | 54.80 | 68.50 | 74.32 | 79.03 | 64.57 | 61.52 |
| GPT-4o-mini | 40.71 | 38.85 | 46.67 | 44.57 | 40.59 | 52.64 | 65.20 | 73.14 | 79.24 | 65.00 | 54.57 |
| Claude-3.5-Sonnet | 48.48 | 42.03 | 67.91 | 52.22 | 50.94 | 45.96 | 75.78 | 84.07 | 58.06 | 61.49 | 59.20 |
| DeepSeek-R1 | 73.71 | 71.89 | 74.69 | 72.44 | 58.66 | 58.20 | 69.66 | 79.18 | 74.79 | 63.09 | 69.72 |
| Qwen3-8B | 63.14 | 59.20 | 70.80 | 69.33 | 55.16 | 54.48 | 74.68 | 73.98 | 69.73 | 55.11 | 64.69 |
| DeepSeek-V3 | 56.62 | 54.07 | 59.85 | 52.08 | 52.18 | 51.92 | 63.42 | 72.29 | 66.74 | 45.31 | 57.50 |
| Llama4-Maverick | 46.47 | 47.79 | 48.20 | 43.61 | 48.32 | 49.28 | 64.81 | 72.71 | 63.02 | 54.15 | 53.65 |
| Llama4-Scout | 48.93 | 47.70 | 46.90 | 46.17 | 39.77 | 48.08 | 59.57 | 66.27 | 61.88 | 48.51 | 51.16 |
| Llama3.1-70B-it | 38.25 | 39.73 | 44.44 | 41.29 | 44.70 | 44.00 | 59.31 | 70.17 | 66.53 | 51.91 | 49.80 |
| Qwen2.5-7B-it | 28.10 | 32.65 | 43.30 | 39.46 | 36.15 | 45.60 | 53.99 | 62.46 | 68.18 | 59.68 | 46.62 |
| GLM4-9B-Chat | 31.41 | 25.84 | 47.82 | 43.45 | 36.03 | 44.56 | 57.94 | 60.51 | 67.77 | 50.96 | 46.46 |
| Llama3.1-8B-it | 28.85 | 34.34 | 42.76 | 39.78 | 38.29 | 46.56 | 52.62 | 59.32 | 64.26 | 49.36 | 45.50 |
| Gemma2-9B-it | 32.91 | 32.21 | 42.91 | 37.22 | 37.39 | 50.48 | 56.57 | 57.29 | 37.77 | 29.67 | 42.21 |
| Ministral-8B-it | 23.08 | 19.12 | 35.56 | 37.38 | 22.76 | 37.92 | 48.51 | 52.88 | 48.14 | 45.32 | 37.58 |
| ChemDFM-v1.5-8B | 33.65 | 31.15 | 35.56 | 36.82 | 40.43 | 30.72 | 49.70 | 56.44 | 26.11 | 18.91 | 36.80 |
| SciGLM-6B | 11.86 | 11.50 | 17.70 | 14.94 | 19.56 | 20.88 | 21.46 | 28.31 | 44.17 | 31.35 | 21.58 |
| LlaSMol-Mistral-7B | 13.35 | 12.83 | 16.55 | 14.70 | 21.54 | 19.84 | 22.83 | 29.92 | 33.13 | 20.98 | 20.42 |
| ChemLLM-7B-chat | 3.42 | 6.02 | 8.81 | 8.15 | 13.45 | 5.92 | 5.15 | 15.51 | 39.94 | 22.67 | 12.16 |
Underline results indicate the best results among all models. Bold results indicate the best results in each category.
Table 5.
Evaluation results of LLMs across four competencies on SciCUEval.
| Models | Info. | Abs. | Info. | Con. | Overall |
|---|---|---|---|---|---|
| Ident. | Detec. | Integ. | Infer. | ||
| GPT-4o | 89.72 | 19.51 | 54.90 | 65.97 | 61.52 |
| GPT-4o-mini | 77.81 | 14.71 | 47.54 | 57.68 | 54.57 |
| Claude-3.5-Sonnet | 82.95 | 49.10 | 50.85 | 47.29 | 59.20 |
| DeepSeek-R1 | 94.13 | 11.75 | 72.78 | 79.44 | 69.72 |
| Qwen3-8B | 88.29 | 43.57 | 53.87 | 64.38 | 64.69 |
| DeepSeek-V3 | 90.80 | 6.05 | 49.80 | 60.53 | 57.50 |
| Llama4-Maverick | 77.61 | 7.16 | 54.16 | 58.52 | 53.65 |
| Llama4-Scout | 71.38 | 25.42 | 43.65 | 54.95 | 51.16 |
| Llama3.1-70B-it | 81.05 | 6.87 | 45.44 | 47.76 | 49.80 |
| Qwen2.5-7B-it | 69.92 | 9.02 | 42.95 | 50.34 | 46.62 |
| GLM4-9B | 71.48 | 2.53 | 50.78 | 43.82 | 46.46 |
| Llama3.1-8B-it | 75.34 | 5.88 | 41.47 | 39.99 | 45.50 |
| Gemma2-9B-it | 66.97 | 2.38 | 28.74 | 48.22 | 42.20 |
| Ministral-8B-it | 56.80 | 4.76 | 31.32 | 39.62 | 37.58 |
| ChemDFM-v1.5-8B | 45.49 | 19.31 | 22.23 | 46.40 | 36.80 |
| SciGLM-6B | 33.24 | 9.01 | 18.00 | 29.48 | 21.58 |
| LlaSMol-Mistral-7B | 31.96 | 6.83 | 14.63 | 26.59 | 20.42 |
| ChemLLM-7B-Chat | 20.29 | 4.09 | 16.85 | 7.57 | 12.16 |
Underline results indicate the best results among all models. Bold results indicate the best results in each category.
Table 6.
Evaluation results of LLMs across three modalities on SciCUEval.
| Models | Text | Table | KG | Overall |
|---|---|---|---|---|
| GPT-4o | 71.91 | 55.91 | 63.13 | 61.52 |
| GPT-4o-mini | 72.22 | 42.98 | 55.84 | 54.57 |
| Claude-3.5-Sonnet | 59.75 | 53.41 | 64.99 | 59.20 |
| DeepSeek-R1 | 69.00 | 73.19 | 66.64 | 69.72 |
| Qwen3-8B | 62.53 | 66.02 | 64.27 | 64.69 |
| DeepSeek-V3 | 56.18 | 55.68 | 59.77 | 57.50 |
| Llama4-Maverick | 58.65 | 46.51 | 58.54 | 53.65 |
| Llama4-Scout | 55.29 | 47.31 | 53.22 | 51.16 |
| Llama3.1-70B-it | 59.33 | 41.19 | 54.30 | 49.80 |
| Qwen2.5-7B-it | 69.93 | 36.58 | 49.38 | 46.24 |
| GLM4-9B-Chat | 59.49 | 37.94 | 49.46 | 46.46 |
| Llama3.1-8B-it | 56.92 | 37.08 | 49.06 | 45.50 |
| Gemma2-9B-it | 34.27 | 36.73 | 50.23 | 42.21 |
| Ministral-8B-it | 46.75 | 29.50 | 41.24 | 37.58 |
| ChemDFM-v1.5-8B | 22.92 | 34.44 | 44.05 | 36.80 |
| SciGLM-6B | 38.48 | 14.25 | 22.51 | 21.58 |
| LlaSMol-Mistral-7B | 27.74 | 14.49 | 23.45 | 20.42 |
| ChemLLM-7B-chat | 32.28 | 6.86 | 10.01 | 12.16 |
Underline results indicate the best results among all models. Bold results indicate the best results in each category.
Evaluation Results of Four Competencies
Relevant Information Identification
This competency measures a model’s ability to locate and select the correct pieces of information from the provided context. As shown in Fig. 3, DeepSeek-R1 leads all of the evaluated models, suggesting that explicit reasoning mechanisms effectively enhance factual grounding. DeepSeek-V3, GPT-4o, and Qwen-8B also exhibit strong performance, showing the advantages of in-context retrieval capabilities. In contrast, scientific-domain LLMs exhibit notably weaker performance in identifying relevant contexts across diverse scenarios.
Fig. 3.
Performance of LLMs across four competencies on SciCUEval.
Information-absence Detection
“This metric evaluates whether a model appropriately withholds an answer when the required information is absent. While Claude-3.5-Sonnet and Qwen-8B demonstrate relatively high accuracy, most models struggle significantly in this competency, with scores below 20%. To investigate this, we conducted an ablation study on prompting strategies (Supplement Table 4). The results reveal a stark contrast: while models possess the intrinsic capability to detect information absence–achieving high rejection rates when explicitly instructed to refuse–they exhibit a severe tendency to hallucinate under standard prompting conditions. This highlights the critical risk of “overconfidence” in current models, which may pose potential safety risks in the scientific domain where users may not always provide defensive instructions.”
Multi-source Information Integration
This task assesses a model’s ability to synthesize information from multiple entries to construct a complete answer. DeepSeek-R1 achieves the highest performance, followed by GPT-4o and Llama4-Maverick, suggesting that these models are better equipped to combine multiple data points into coherent and accurate answers. Among smaller open-source models, GLM4-9B shows a competitive score, even surpassing DeepSeek-V3 in this competency. However, scientific LLMs significantly lag behind, indicating that while these domain-specific models are adept at handling scientific text, they face challenges in effectively synthesizing information from diverse sources.
Context-aware Inference
This capability reflects a model’s ability to reason over contextually relevant information. DeepSeek-R1 achieves the highest performance, and GPT-4o and Qwen3-8B also perform well, indicating that large-scale models and those enhanced with explicit thinking benefit significantly in contextual reasoning tasks. In contrast, models like Claude-3.5-Sonnet and DeepSeek-V3 show moderate capabilities but fall behind on deeper inference. Scientific-domain models such as ChemLLM-7B-Chat and SciGLM perform poorly, indicating limited general reasoning capabilities despite domain specialization.
Evaluation Results of Three Modalities
Figure 4 show the performance of LLMs across three modalities: Text, Table, and KG, highlighting their strengths and weaknesses in handling diverse scientific data formats. Overall, LLMs tend to perform best on the text modality, reflecting their strong natural language understanding and generation capabilities. Notably, some smaller models even exceed their average overall performance on text, indicating a relative maturity in handling unstructured text data. In the table modality, reasoning-augmented models demonstrate a clear advantage, suggesting that explicit reasoning mechanisms and the ability to process structured data significantly benefit table understanding. In contrast, general LLMs show weaker performance on tables, implying challenges in leveraging tabular structure with traditional language modeling approaches. Similarly, for KG data, models with reasoning enhancements again lead, reflecting their ability to leverage relational and graph-structured information effectively. Additionally, domain-specific scientific models consistently underperform across all three modalities compared to general-purpose or reasoning-augmented models.
Fig. 4.
Performance of LLMs across three modalities on SciCUEval.
Findings
Our experimental results highlight three key discrepancies in the performance of LLMs on scientific context understanding tasks, underscoring fundamental challenges that require further advancements.
Competency Discrepancy
The evaluation results reveal notable disparities across the four core competencies. While top-performing models exhibit relatively strong capabilities in identifying relevant information, they struggle with information-absence detection–the ability to abstain from answering when faced with unreliable or insufficient evidence. This suggests that models prioritize generating responses over ensuring accuracy, increasing the risk of hallucinations in scientific applications where factual correctness is critical. To address this, models should incorporate uncertainty quantification techniques, such as confidence-based rejection mechanisms and calibrated probability outputs, to enhance their ability to detect and reject misleading retrievals. Furthermore, reinforcement learning with human feedback and verification-based prompting strategies could help improve the model’s reliability in rejecting incorrect information.
Modality Discrepancy
LLMs exhibit relatively better performance on unstructured text compared to structured tables and KGs. This suggests that existing models rely heavily on linguistic patterns and semantic context rather than structured inference and multi-modal data integration. The weaker performance on tables and KGs indicates a bottleneck in structured data comprehension, where models struggle to extract, synthesize, and infer information effectively from unstructured data. To bridge this gap, models need improved cross-modal alignment, integrating structured data reasoning into their training paradigm. Techniques such as joint pretraining on text, tables, and graphs could enhance structured data understanding.
Specialized vs. General Model Discrepancy
Although scientific LLMs are explicitly designed for knowledge-intensive tasks, our evaluation shows that they often fail to outperform general-purpose models on our dataset. This suggests that current specialized models lack sufficient reasoning depth, robustness, and flexibility to fully leverage domain knowledge in complex scientific contexts. Their narrower training scope may limit generalization across diverse data modalities and reasoning challenges. To improve their contextual understanding, scientific models should incorporate targeted fine-tuning using curated scientific evidence and adopt domain-aware prompt engineering strategies. These approaches can help balance deep specialization with the adaptability required to tackle a broad range of scientific tasks, enhancing their effectiveness across diverse scenarios.
Supplementary information
Acknowledgements
This work is funded by New Generation Artificial Intelligence - National Science and Technology Major Project (2025ZD0122801, H.C.), NSFC62301480 (K.D.), NSFC62302433 (Q.Z.), NSFCU23A20496 (Q.Z.), and Ant Group Research Fund (K.D.). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of Chinese Academy of Sciences.
Author contributions
J.Y. and Y.T. contributed equally to this work. J.Y., Y.T., and K.D. conceived the study and designed the method. J.Y. and Y.T. implemented the method, conducted the experiments, and performed the result analyses. K.F. and M.R. provided guidance and assistance in dataset construction. J.Y., Y.T., Q.Z., K.D., and H.C. wrote and revised the manuscript. K.D. and H.C. supervised the entire project. All authors wrote the paper, reviewed it, and approved the final paper.
Data availability
The complete SciCUEval dataset used in this study has been uploaded to figshare (10.6084/m9.figshare.29924687)27.
Code availability
The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Keyan Ding, Email: dingkeyan@zju.edu.cn.
Huajun Chen, Email: huajunsir@zju.edu.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41597-026-06594-9.
References
- 1.Bai, J. et al. Qwen technical report. arXiv:2309.16609 (2023).
- 2.OpenAI, et al. GPT-4o System Card. arXiv:2410.21276 (2024).
- 3.Dubey, A. et al. The Llama 3 herd of models. arXiv:2407.21783 (2024).
- 4.Qwen, et al. Qwen2.5 Technical Report. arXiv:2412.15115 (2025).
- 5.Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv:1903.10676 (2019).
- 6.Mann, B. et al. Language models are few-shot learners. arXiv:2005.14165 (2020).
- 7.Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems33, 9459–9474 (2020). [Google Scholar]
- 8.Mialon, G. et al. Augmented language models: A survey. arXiv:2302.07842 (2023).
- 9.Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv:2402.13178 (2024).
- 10.Liu, J. et al. RepoQA: Evaluating long context code understanding. arXiv:2406.06025 (2024).
- 11.Li, T., Zhang, G., Do, Q. D., Yue, X. & Chen, W. Long-context LLMs struggle with long in-context learning. arXiv:2404.02060 (2024).
- 12.Bai, Y. et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508 (2023).
- 13.Bai, Y. et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv:2412.15204 (2024).
- 14.Wellawatte, G. P. et al. ChemLit-QA: A human evaluated dataset for chemistry RAG tasks. Machine Learning: Science and Technology6, 020601 (2025). [Google Scholar]
- 15.Zhong, X. et al. Benchmarking Retrieval-Augmented Generation for Chemistry. arXiv:2505.07671 (2025).
- 16.Fang, X. et al. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding-A Survey. arXiv:2402.17944 (2024).
- 17.He, X. et al. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems37, 132876–132907 (2024). [Google Scholar]
- 18.Talmor, A. & Berant, J. The Web as a knowledge-base for answering complex questions. arXiv:1803.06643 (2018).
- 19.International Atomic Energy Agency (IAEA). Nuclear Data Services - ENSDF Query Form. Available at: https://www-nds.iaea.org/relnsd/NdsEnsdf/QueryForm.html.
- 20.The Materials Project. Available at: https://next-gen.materialsproject.org.
- 21.National Center for Biotechnology Information (NCBI). Available at: https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72.
- 22.Gene Ontology Consortium. Available at: https://geneontology.org/docs/download-ontology/.
- 23.Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research gkw985 (2016). [DOI] [PMC free article] [PubMed]
- 24.Zheng, S. et al. PharmKG: A dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics22, bbaa344 (2021). [DOI] [PubMed] [Google Scholar]
- 25.Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Scientific Data10(1), 67 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084 (2019).
- 27.Ding, K. & Tang, Y. SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models. figshare10.6084/m9.figshare.29924687 (2025). [DOI] [PMC free article] [PubMed]
- 28.Anthropic AI. The Claude 3 model family: Opus, Sonnet, Haiku. Claude-3 Model Card (2024).
- 29.DeepSeek-AI. et al. DeepSeek-V3 Technical Report. arXiv:2412.19437 (2024).
- 30.Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 (2025).
- 31.Yang, A. et al. Qwen3 Technical Report. arXiv:2505.09388 (2025).
- 32.Meta. Llama 4: Leading Intelligence. Available at: https://www.llama.com/models/llama-4/. Accessed: 2025-05-19.
- 33.Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).
- 34.Team GLM. et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 (2024).
- 35.Gemma Team et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 (2024).
- 36.Zhang, D. et al. SciGLM: Training scientific language models with self-reflective instruction annotation and tuning. arXiv:2401.07950 (2024).
- 37.Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv:2402.09391 (2024).
- 38.Zhang, D. et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 (2024).
- 39.Zhao, Z. et al. ChemDFM: Dialogue Foundation Model for Chemistry. arXiv:2401.14818 (2024).
- 40.Chen, J., Lin, H., Han, X. & Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence38(16), 17754–17762 (2024). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The complete SciCUEval dataset used in this study has been uploaded to figshare (10.6084/m9.figshare.29924687)27.
The SciCUEval evaluation scripts for this study have been uploaded to GitHub (https://github.com/HICAI-ZJU/SciCUEval).




