Abstract
Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design. For instance, these models excel at accurately extracting key findings from papers or generating coherent experimental procedures. However, existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs’ scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford’s creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five dimensions: originality, feasibility, fluency, flexibility, and clarity. Through experimentation with over 40 leading models across 1180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark are poorly predicted by standard metrics of general intelligence scores. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to models such as claude-3.7-sonnet: thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities. Such strategies could potentially enable a wider range of AI tools tailored for different stages of the scientific process.
Subject terms: Computer science, Research data, Scientific data
The authors introduce a comprehensive benchmark to evaluate LLMs’ scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts.
Introduction
The advancement of scientific knowledge relies heavily on creative thinking and the generation of novel hypotheses. The ability to envision new possibilities and formulate testable explanations is crucial for scientific progress. In recent years, LLMs have demonstrated remarkable capabilities in various scientific tasks, from literature analysis to experimental design, suggesting their potential as powerful tools for augmenting scientific discovery1–3. Concurrently, machine learning techniques are being applied to forecast future research directions by analyzing the structure and evolution of scientific knowledge networks4. As Rafner et al.5 note, these generative AI systems now perform comparably to humans on some creativity tests and could potentially enhance the creative capabilities of knowledge workers worldwide.
To understand the potential of AI in this domain, early theoretical work on human creativity offers valuable insights for our investigation. In their seminal 1962 work, Getzels and Jackson6 examined the relationship between intelligence and creativity, particularly among “gifted” students. Their study yielded two crucial findings: first, high IQ does not necessarily equate to high creativity; and second, creativity and intelligence function as relatively independent traits. These observations led to the threshold theory6,7, proposing that intelligence is necessary but not sufficient for creativity, with its influence diminishing beyond a certain threshold.
However, subsequent research has yielded mixed evidence for this theory. While Gralewski et al.8 and Jauk et al.9 found some empirical support under certain conditions, some studies found no support10,11, challenging the threshold concept. More recent work by Weiss et al.12 concluded that the relationship is likely linear with no evidence supporting a threshold. These varying findings suggest that the development of idea generation capabilities might be underpinned by different cognitive mechanisms than general intelligence, a pattern potentially relevant to LLM development as well. It’s also noteworthy that the threshold debate has nuances regarding potential versus achievement; some research suggests that while a threshold might (or might not) apply to divergent thinking ability, intelligence may remain related9,13 or even become more important for translating potential into real-world creative achievements across the entire range14,15. These early studies primarily focused on the relationship between general cognitive abilities and creative potential, often measured via divergent thinking tasks. While much of this debate centers on general creativity, the specific relationship between intelligence and scientific creativity, which heavily relies on deep domain knowledge and rigorous methodology alongside ideation, remains an area requiring nuanced investigation, though the general finding of creativity and intelligence being distinct capabilities provides a useful starting point for examining LLMs.
Another theoretical strand addresses creativity assessment methodologies. Guilford’s influential theoretical framework7,16 introduced the distinction between divergent thinking (generating varied responses to open-ended prompts) and convergent thinking (finding optimal solutions to well-defined problems). Guilford identified four key aspects of divergent thinking: the ability to generate a large number of ideas (fluency), the capacity to think across different categories (flexibility), the generation of novel ideas (originality), and the development of detailed and refined ideas (elaboration). More recent research by Cortes et al.17 has noted that traditional creativity assessments typically involve elements of both divergent and convergent thinking rather than isolating either process. Their findings indicate that purportedly divergent and convergent tasks show limited correlation, questioning whether they reflect different components of the same higher-level construct (creativity). This work highlights the complexity of measuring creative processes and informs our approach to evaluating idea generation in LLMs.
Broader theoretical frameworks further enrich this picture. For instance, Margaret Boden18 distinguishes between psychological creativity (P-creativity, novel to the individual) and historical creativity (H-creativity, novel to humanity), while Teresa Amabile’s19 componential theory highlights the interplay of domain-relevant skills, creativity-relevant processes (including divergent thinking), and intrinsic task motivation. Melvin Rhodes’20 “4 P’s” (Person, Process, Press, Product) provides a holistic view, reminding us that creativity involves the individual creator, the cognitive processes, the environmental influences, and the resulting outcome. While our approach primarily assesses aspects of the creative “process” (specifically, idea generation fluency and diversity under minimal constraints), understanding these broader theories helps contextualize its scope and limitations within the larger landscape of scientific creativity, which ultimately demands valuable and impactful products.
These theoretical frameworks from human creativity research have informed modern approaches to evaluating LLMs’ idea generation capabilities. Contemporary creativity research has expanded assessment methodologies beyond traditional paper-based tests, with Rafner et al.21 developing CREA, a comprehensive game-based tool for measuring divergent and convergent thinking. Their validation studies across diverse populations have demonstrated the effectiveness of these novel assessment approaches. Benedek’s15 recent work clarifies that creative potential (measured through divergent thinking tasks) explains only limited variance in creative achievement, emphasizing complex interactions between potential, behavior, and environmental factors.
Recent research has systematically explored LLMs’ potential for tasks analogous to human creative processes through various methodological approaches. Rafner et al.5 provided a critical analysis of creativity in the age of generative AI, noting that while these systems now perform comparably to humans on some creativity tests, important qualitative differences remain. Their analysis emphasized the importance of integrating psychological science with computational approaches to develop more effective creativity support tools, highlighting both the promising capabilities and important limitations of current systems. Empirical assessments in creative contexts have shown promising results, exploring LLMs’ potential through various applications. For instance, Meincke et al.22 focused on product ideation, comparing the quality and novelty of ideas generated by GPT-4 against those generated by humans. Comprehensive human evaluation studies23 have compared LLM-generated research proposals with those from NLP researchers, revealing statistically significant advantages in novelty while maintaining comparable feasibility metrics. Complementing these approaches, Lee et al.24 conducted an empirical study investigating the impact of using ChatGPT on human creative performance, finding that it significantly enhanced the articulation and creativity of human-generated ideas.
Studies directly examining divergent thinking abilities in LLMs have employed various methodological approaches. Cropley25 analyzed ChatGPT on the Divergent Association Task and found that while GPT-4 demonstrated strong semantic distance capabilities in generating unrelated words, its inconsistency and predictability highlighted important differences from human creative processes. This research emphasized the distinction between divergent thinking abilities and true creativity, suggesting the importance of understanding both the capabilities and limitations of these systems. Marrone, Cropley, and Medeiros26 further examined how narrow AI impacts human creativity, identifying both supportive functions and important limitations in creative processes. However, Wenger and Kenett27 found that while LLMs performed on par with humans in individual creativity tests, they exhibited significantly lower response diversity at the group level, showing high homogeneity in creative outputs even across different model families. This creative homogeneity represents an important limitation in current systems. Confirming this potential downside, Doshi et al.28, in an experimental study on story writing, found that while access to generative AI ideas caused stories to be evaluated as more creative individually (especially for less creative writers), these AI-enabled stories were significantly more similar to each other than stories produced by humans alone, pointing to a potential reduction in collective novelty. These findings collectively underscore the importance of evaluating not just the quality but also the diversity and novelty of AI-generated creative content.
The assessment methodology for LLM creativity has evolved through several sophisticated frameworks designed for broad evaluation. For instance, Lu et al.29 evaluated text-to-code creativity using programming challenges combined with prompting strategies designed to force novel solutions by restricting previously used techniques, though potential reliance on static problem sets raises limitations. Building upon foundational creativity theories7, Zhao et al.30 proposed a comprehensive, albeit static, framework adapting established psychological test tasks to assess general creativity across multiple dimensions. These frameworks offer valuable, comprehensive assessments for their respective areas, often incorporating both divergent and convergent aspects or multiple task types with richer contextual inputs.
In the context of reasoning mechanics, Chain-of-thought (CoT) prompting has emerged as a powerful technique for enhancing LLMs’ reasoning capabilities. Following the initial introduction of CoT prompting31, several advanced variants have been developed32,33, enabling models to explore multiple reasoning paths simultaneously. Recent investigations into programming applications34 have demonstrated that incorporating brainstorming significantly improves LLMs’ performance. A common thread among these approaches is their implicit reliance on divergent thinking - the ability to generate multiple distinct solutions or paths from a single starting point.
Furthermore, recent years have seen significant advances in automated scientific systems aimed at aiding or accelerating discovery, each offering unique approaches. The AI Scientist framework35 demonstrated potential for end-to-end research automation, while Nova36 introduced iterative planning mechanisms for idea development. Systems like ResearchAgent37 and Scideator38 have further refined these approaches through knowledge graph integration. Pushing the boundaries further, the recent AI co-scientist framework from Google39 showcases a sophisticated multi-agent system designed to generate novel research hypotheses and detailed research proposals. The system employs specialized agents for generation, reflection, ranking, evolution, proximity analysis, and meta-review to iteratively refine scientific hypotheses through a self-improving cycle, with generated hypotheses later validated through real-world laboratory experiments conducted by human researchers. Similarly, Wang and colleagues’40 SCIMON system leverages literature retrieval and iterative novelty enhancement to generate scientific ideas, while Gu and Krenn41,42 demonstrated approaches using knowledge graphs and LLMs to suggest research directions. However, these specialized systems typically rely heavily on extensive knowledge bases, complex contextual inputs, or planning processes to generate and refine ideas or research plans. While highly effective for their intended applications, this dependence on rich context and complex architectures makes it challenging to isolate and evaluate their fundamental capabilities for initial idea generation from minimal cues, particularly their capacity for divergent thinking. This limitation is particularly evident in systems like SciPIP43 and IdeaSynth44, whose structure encourages integrating existing information rather than generating diverse possibilities from minimal input.
The challenges in evaluating creative output at scale necessitate automated assessment methods. The emergence of LLMs as evaluation tools has opened new possibilities for assessing model outputs at scale. Various evaluation approaches45–50 have demonstrated the feasibility of using LLMs to evaluate other models’ responses, offering advantages in terms of efficiency and cost-effectiveness compared to human evaluation51. Recent advances, such as the jury-based framework52 and reference-guided verdict method53, have shown promising results in reducing individual model biases and achieving high agreement with human judgments. The development of specialized evaluation models like Prometheus 254 further underscores the growing sophistication in this area. However, most existing approaches focus on evaluating responses against predetermined criteria or reference answers, making them better suited for convergent thinking tasks than divergent thinking assessment. Second, the evaluation of scientific creativity presents unique challenges that go beyond traditional metrics, requiring simultaneous assessment of originality, feasibility, clarity, fluency, and flexibility.
The landscape of artificial intelligence has undergone a dramatic transformation since the emergence of ChatGPT in late 2022, with LLMs demonstrating increasingly sophisticated capabilities in scientific contexts. As we approach late 2024, these models have surpassed human expert baselines on various standardized benchmarks across multiple dimensions, exemplified by AlphaGeometry55. Yet, these advances in general intelligence prompt a fundamental question: Does the potential for scientific idea generation in LLMs grow in tandem with their analytical capabilities? This question becomes particularly pertinent when evaluating LLMs’ potential contributions to scientific innovation. Current evaluation benchmarks predominantly rely on rich contextual inputs, such as research paper titles, abstracts, or complete articles. While these approaches effectively assess models’ ability to comprehend and synthesize existing knowledge, they are not designed to systematically evaluate a crucial aspect of scientific thinking - the capacity to generate novel ideas from limited information. This limitation is particularly significant, as many groundbreaking scientific discoveries originate from unexpected connections and conceptual leaps2.
In this work, we introduce LiveIdeaBench (Fig. 1), a novel evaluation benchmark designed to assess LLMs’ capabilities in divergent thinking for scientific idea generation under constrained conditions. Unlike existing benchmarks that predominantly evaluate convergent thinking by requiring models to derive single correct solutions from rich contextual information, LiveIdeaBench is explicitly grounded in Guilford’s seminal theory of divergent production7. Our methodology operationalizes key principles from this framework: we use single-keyword prompts as minimal context. This constraint encourages models to generate connections and concepts primarily from their internal knowledge representations, rather than synthesizing readily available information from detailed prompts (e.g., abstracts or complete articles), thus probing their capacity for less scaffolded, more internally-driven ideation relevant to initial brainstorming. This approach is analogous to the broad stimuli used in classic divergent thinking tasks (e.g., the “Utility Test” described by Guilford), to elicit the generation of multiple and varied potential ideas, fostering a “ready flow of ideas” rather than convergence towards a predefined outcome. This approach necessitates that models “produce their own answers”, a hallmark of divergent production assessment distinct from selection-based or highly constrained convergent thinking tasks.
Fig. 1. Overall design of the LiveIdeaBench benchmark.
a Over 1000 scientific keywords, representing diverse domains, are used in prompts for the Idea LLMs, encouraging divergent thinking and the generation of novel scientific ideas. b Sampled Judge LLMs evaluate the generated ideas across three primary dimensions: originality, feasibility and clarity, assigning numerical scores to each idea. c The evaluation panel comprises the top 10 state-of-the-art models selected from LiveBench, ensuring robust assessment through sampling and ensemble scoring. d Fluency scores are derived by analyzing the diversity and substantive differences among ideas generated from the same keyword (using a randomly sampled judge), while originality, feasibility, and clarity metrics are combined for integrated evaluation. e Following Guilford’s creativity theory, the evaluation methodology assesses five critical dimensions: originality, feasibility, clarity, fluency, and flexibility, with flexibility computed as the 30th percentile of the averaged scores across the other four dimensions. f The LiveIdeaBench benchmark provides a comprehensive dataset of generated ideas, evaluation metrics, and a dynamic leaderboard tracking the performance of over 40 models.
By employing a specialized multi-model jury system, our evaluation directly measures core divergent production dimensions like fluency (quantity and diversity of ideas, reflecting the capacity to generate a “number of varied responses”) and originality (novelty of ideas), as conceptualized by Guilford, across 41 state-of-the-art models using over 1000 scientific keywords from 22 distinct domains. While acknowledging that comprehensive scientific creativity involves an interplay of both divergent and convergent processes, LiveIdeaBench focuses on and evaluates the foundational, generative phase: the ability to produce diverse scientific concepts from sparse cues. This capacity is critical for sparking innovation but is often overlooked by conventional benchmarks. Thus, our focus is on assessing this specific aspect of creative potential relevant to scientific discovery, rather than evaluating the entire scientific process, which also involves deep domain expertise and subsequent validation. We demonstrate that general intelligence metrics do not necessarily correlate with capacity for scientific ideation; notably, models such as QwQ-32B-preview achieve performance comparable to established systems like Claude-3.7-sonnet: thinking despite significant gaps in general benchmarks. We posit that such specialized evaluation serves as a crucial initial step towards developing effective human-AI hybrid intelligence systems56–58, providing the necessary insights to leverage LLMs as collaborative tools for scientific discovery.
Results
We assessed the performance of over 40 language models across multiple dimensions of scientific ideation using our comprehensive evaluation benchmark. A detailed description of the generated idea dataset is provided in Supplementary Note 2, and its constituent fields are listed in Supplementary Table S1. The evaluation results are visualized in Fig. 2. For detailed outcomes, including raw scores and examples, please refer to Supplementary Table S2; aggregated performance metrics and model rankings are presented in Supplementary Table S3.
Fig. 2. Performance comparison of models evaluated on LiveIdeaBench.
a Dimensional scores (originality, feasibility, fluency, clarity, and flexibility) and overall performance (red line, with 95% confidence intervals, n = 6987) for open-weight and proprietary models. The central line of the box represents the median, the box spans the 25th to 75th percentiles, and the whiskers extend to the minimum and maximum values within 1.5 times the interquartile range. b Multidimensional performance profiles of representative models across the five evaluation dimensions. c Word cloud visualization of scientific keywords. For detailed scores and 95% CIs for each model, see Supplementary Table S3.
Diverse model performance across scientific domains
Our quantitative assessment and comparative analysis reveal distinct variations in model capabilities across scientific disciplines. (Fig. 3) While gemini-2.0-flash-thinking-exp, deepseek-r1, and claude-3.7-sonnet: thinking demonstrate high overall scores, the pattern of their relative strengths exhibits domain specificity. Importantly, although larger and more recent architectures generally achieve superior results, their advantage is not uniform across disciplines, suggesting that domain expertise and reasoning capabilities are not solely determined by model scale. For example, gemini-2.0-flash-thinking-exp excelled in anthropology ideation; claude-3.7-sonnet: thinking performed best in chemistry, medicine, and data science; while deepseek-r1 demonstrated stronger relative performance in physics.
Fig. 3. Model performance on liveideabench across scientific categories.
The heatmap displays average performance scores with 95% confidence intervals for model-discipline combinations. Scientific categories were classified using SciBERT82 through semantic similarity computation following the framework from83. Higher scores (pink) indicate better idea generation ability within each discipline. Numbers in parentheses following each scientific category indicate the keyword count associated with that discipline. Categories are sorted by keyword count.
Distinction between general intelligence and scientific idea generation
The comparison between LiveIdeaBench and LiveBench metrics (see Fig. 4 and Supplementary Fig. S4) uncovers a notable disconnect between general intelligence and scientific idea generation capabilities, revealing contrasting results patterns across models. While Claude-3.7-sonnet: thinking leads in general intelligence (LiveBench) and also shows high effectiveness on scientific ideation (LiveIdeaBench), its idea generation capabilities are matched by models like qwq-32b-preview (ranked 8/41 in LiveIdeaBench). Notably, qwq-32b-preview exhibits low general intelligence scores, representing a profile of low general intelligence but high scientific ideation capability. As Fig. 4 illustrates, other patterns exist, such as high general intelligence paired with lower scientific ideation capability (e.g., o3-mini-high). These contrasting profiles underscore that scientific idea generation capability, as measured by LiveIdeaBench, is poorly predicted by general intelligence, necessitating specialized benchmarks like LiveIdeaBench for evaluating this aspect of LLM potential. This statistically significant but weak positive correlation (r(41) = 0.357, p = 0.038, r2 = 0.127, 95%CI = [0.022, 0.621]; see Supplementary Fig. S4) suggests that fostering scientific idea generation capabilities in these systems may benefit from distinct development approaches compared to enhancing general problem-solving skills.
Fig. 4. Comparison of LiveIdeaBench (upper half, assessing scientific ideation) and LiveBench (lower half, assessing general intelligence, denoted as LB) across various evaluation metrics for the same set of models.
The middle rows show contrasting trends between models in average performance. While claude-3.7-sonnet: thinking achieves the highest average score on LiveBench, its performance on scientific ideation tasks (LiveIdeaBench) is comparable to qwq-32b-preview, which ranks fourth from last on general intelligence metrics. The arrows highlight three representative patterns: claude-3.7-sonnet: thinking (left) exemplifies high general intelligence combined with high scientific ideation capability; o3-mini-high (middle) shows high general intelligence but low scientific ideation capability; and qwq-32b-preview (right) demonstrates low general intelligence but high scientific ideation capability. These contrasting patterns highlight that LLMs' scientific idea generation capability, as measured by LiveIdeaBench, is distinct from their general intelligence capabilities (e.g., reasoning, coding, and math), underscoring the necessity of LiveIdeaBench for evaluating scientific ideation potential.
Impact of model architecture and training strategies
We observe that Mistral-small, which ranks near the bottom in general intelligence, demonstrates remarkably high scientific divergent thinking capabilities. Even more interestingly, the earlier released Mistral-small model scored significantly higher in creativity than the newer same-size, same-family model Mistral-small-24b-instruct-2501. We also observe that models from the same family (for example, qwq-32b and qwq-32b-preview) score similarly in creativity while differing dramatically in general intelligence. These two pairs of examples strongly suggest that a model’s divergent thinking and convergent thinking abilities cannot reliably predict each other. This pattern holds true even when comparing models from the same family with different parameter counts. Focusing on mistral-large-2411 and mistral-small (Fig. 4), we see that mistral-small still slightly outperforms mistral-large-2411 in scientific divergent thinking, despite the latter having 123 billion parameters, far exceeding the former’s 24 billion. This further suggests that parameter count alone does not appear to be the primary determinant of scientific divergence capabilities, suggesting other factors like training data or architectural nuances play significant roles.
Trade-offs in scientific originality and feasibility
The Pareto front visualization (see Fig. 5) illustrates clear trade-offs between feasibility and originality. While Claude-3.7-sonnet: thinking achieves the highest originality with moderate feasibility, Nova-pro-v1 demonstrates the opposite pattern. Models like deepseek-r1, qwq-32b, and gemini-2.0-flash-exp exhibit balanced effectiveness between these two dimensions. Particularly, deepseek-r1 stands out for its exceptional all-around capabilities across all measured dimensions of scientific idea generation, demonstrating that balanced effectiveness is achievable despite common tendencies toward specialization. Additionally, while models like Claude-3.7-sonnet: thinking and several Gemini variants show strong fluency and flexibility, the mediocre results of o3-mini-high, known for logical reasoning, highlight the specific nature of the capabilities measured by this benchmark.
Fig. 5. Pareto front visualization of model performance on LiveIdeaBench.
This illustrates the trade-off between feasibility and originality across different language models, with bubble size, color gradient and edge width representing fluency, flexibility and clarity scores, respectively. The distribution reveals a clear Pareto frontier, where claude-3.7-sonnet: thinking achieves the highest originality but moderate feasibility, while nova-pro-v1 demonstrates the opposite pattern. Models such as deepseek-r1, qwq-32b, and gemini-2.0-flash-exp exhibit relatively balanced performance between these two dimensions. In terms of fluency and flexibility, claude-3.7-sonnet, claude-3.7-sonnet: thinking, gemini-2.0-flash-exp, gemini-2.0-pro-exp-02-05 and gemini-2.0-flash-thinking-exp show particularly strong performance. Notably, deepseek-r1 stands out for its exceptional all-round performance across all dimensions of scientific idea generation, suggesting strong performance across the dimensions measured by this benchmark. Moreover, while the o3-mini-high is renowned for its proficiency in logical and mathematical reasoning, it delivered a mediocre performance on this benchmark. For detailed scores and 95% CIs for each model, see Supplementary Table S3.
Independence of idea quality from length
As shown in Fig. 6, while the majority of models generally adhered well to the 100-word limit specified in the prompt (Fig. 6b), their mean originality, feasibility, and clarity scores demonstrated significant variations across different models (Fig. 6a, c–e). Further analysis of idea length (see Supplementary Figs. S1 and S2) reveals a statistically significant but very weak positive correlation with idea quality (ρ(286490) = 0.089, p < 0.0001, ρ2 = 0.008, 95%CI = [0.085, 0.093]). Even for models specifically designed for reasoning, as detailed in Supplementary Note 8, the relationship between the length of their thoughts and the generated idea quality remains very weak. This observation lends further support to the notion that effective idea generation is not merely a function of extensive logical elaboration, distinguishing it from certain aspects of general reasoning.
Fig. 6. Comprehensive analysis of idea quality, length, and component scores.
a Mean composite idea scores (average of Originality, Feasibility, and Clarity) with 95% confidence intervals for different language models. b Mean idea length in words across models with specific word counts labeled. Detailed breakdown of average performance on individual dimensions: Originality (c), Feasibility (d), and Clarity (e) scores with 95% confidence intervals. (n = 6987) For the regression plot between clarity scores and originality, feasibility scores across all generated ideas, see Supplementary Fig. S5.
Discussion
LiveIdeaBench represents a comprehensive benchmark specifically designed to evaluate LLMs’ divergent thinking capabilities in scientific innovation. Existing LLM benchmarks predominantly focus on problem-solving tasks such as logical reasoning, mathematical computation, and code generation. These benchmarks inherently assess convergent thinking–the ability to arrive at predetermined correct answers through structured problem-solving (e.g., selecting the right option in multiple choice questions, completing text with expected words, or fixing code to match specific requirements). This stands in contrast to divergent thinking, which involves generating diverse, novel solutions from minimal contextual input.
Furthermore, our benchmark incorporates mechanisms to address potential data contamination and overfitting issues that commonly plague static benchmarks. Traditional evaluation methods may encourage models to perform well on specific test cases without developing generalizable creative thinking abilities. Our approach employs a dynamic judge panel comprising multiple state-of-the-art models, randomly sampling multiple LLMs for evaluation and employing ensemble scoring methods. This design not only minimizes individual model biases but also leverages the continually updated knowledge bases of judge models, effectively preventing the limitations associated with fixed benchmarks. This methodology aligns with recent advances in live benchmarking59,60, which similarly address data contamination and overfitting concerns through dynamic evaluation mechanisms.
Our analysis through LiveIdeaBench yields several notable insights into LLMs’ scientific idea generation capabilities. Most notably, we find that a model’s performance on these divergent thinking tasks is not strongly coupled with its performance on general intelligence benchmarks. This suggests that fostering scientific idea generation capabilities in these systems may benefit from distinct development approaches compared to enhancing general problem-solving skills, such as prioritizing training objectives that reward diverse conceptual associations rather than convergent accuracy. These insights are particularly vital for constructing hybrid human-AI frameworks, where specialized models can be tailored to augment specific stages of the human creative process. The varying strengths we observe across different model architectures, particularly in originality versus feasibility trade-offs, point to potential complementarity in scientific applications.
These findings, particularly the distinct performance profiles where some models excel in ideation despite lower general intelligence, indicate that LLMs’ divergent thinking capabilities operate largely independently from the convergent thinking abilities typically measured by problem-solving tasks. We believe this phenomenon is closely tied to variations in the relevance of pre-training data to scientific tasks, differences in post-training methodologies applied, and inherent architectural properties, rather than being determined by model scale alone. This observed difference between metrics underscores the importance of specialized evaluation benchmarks like LiveIdeaBench that specifically target these idea generation capabilities, rather than attempting to infer them from general intelligence assessments.
However, several limitations warrant consideration. The use of contemporary state-of-the-art models as judges makes temporal comparisons difficult. When the judge panel composition changes with model updates, direct performance comparisons across different evaluation periods become unreliable. While this limitation mirrors challenges faced by other dynamic benchmarks like LiveBench, it limits our ability to track longitudinal trends in model capabilities. Furthermore, our evaluation includes several proprietary, closed-source models (e.g., GPT-4o). Access to these models is typically via APIs, which may be subject to change, and the underlying models can be updated without public notice, potentially impacting exact reproducibility. While including these models is essential for a comprehensive assessment of the current state-of-the-art, we acknowledge this inherent limitation regarding reproducibility compared to evaluations focused solely on open-weight models. Moreover, the LLM-as-a-judge approach itself introduces a potential source of bias. Aligned models can exhibit sycophancy61–63, a tendency to agree with the user or produce agreeable outputs, which may lead to inflated absolute scores and a compressed score range, further reinforcing the need to focus on relative rather than absolute performance rankings. Additionally, we observe an inherent tension between models’ safety constraints and creative evaluation. Some models exemplify this behavior by declining to generate ideas for potentially sensitive keywords (e.g., “ecotoxicology”). While such safety measures are crucial, they can negatively impact creativity scores, potentially undervaluing models with stronger ethical constraints. The potential for hallucinations in the generated ideas themselves64 also underscores the necessity of human oversight in practical applications.
Furthermore, we recognize a fundamental challenge in the reliability of LLM-as-a-Judge approaches. When evaluating scientific ideas containing concepts outside the judge models’ knowledge boundaries, these models might misunderstand novel concepts and consequently misjudge their originality or feasibility. While our use of a dynamic panel of state-of-the-art judge models likely provides broader and more current knowledge coverage than static panels, this fundamental limitation persists. While our multi-model ensemble approach mitigates this issue to some extent, more comprehensive solutions could involve retrieval-augmented generation (RAG) approaches that incorporate up-to-date scientific literature. By augmenting LLM judges with the ability to retrieve and reference the latest scientific papers, such approaches could significantly enhance evaluation accuracy, particularly for highly specialized or cutting-edge scientific domains, although implementing effective RAG systems for this purpose presents its own challenges regarding retrieval relevance and integration with the judging process. This represents a promising direction for future work that could further strengthen the reliability of LLM-based creativity assessments. To empirically assess this reliability, we conducted a human expert validation focused on the Partial Differential Equations (PDE) domain (see Supplementary Note 7). The results showed encouraging alignment between human experts and LLM judgments, particularly for originality, lending empirical support to the LLM-as-a-judge approach, at least within this specific domain.
Looking ahead, several research directions emerge. To address temporal comparability issues, normalized scoring mechanisms could maintain meaningful cross-temporal comparisons while preserving the advantages of dynamic evaluation. The comprehensive dataset generated through our evaluations offers opportunities for training scientific language models, mining ideation patterns, and investigating novel scientific ideas. Future work must also tackle the challenge of fairly evaluating creative potential while accounting for ethical constraints, possibly through domain-specific scoring adjustments or separate evaluation tracks for models with different safety priorities.
The implications of our findings extend beyond model evaluation to the broader landscape of AI-assisted scientific discovery. As LLMs demonstrate increasingly sophisticated idea generation capabilities, they hold promise as powerful tools for accelerating scientific innovation - from hypothesis generation to experimental design65,66. Our benchmark provides a foundation for understanding and improving these capabilities, potentially enabling more effective human-AI collaboration56–58 and informing the design of human-aware AI systems for science67 by leveraging specific model profiles to complement human cognitive blind spots in pushing the boundaries of scientific knowledge. The creative strengths we observed in different model architectures suggest that a diverse ecosystem of AI tools, each with complementary capabilities, could support different aspects of the scientific process, for instance, by pairing high-originality models for initial hypothesis generation with high-feasibility models for experimental implementation.
These findings and challenges point toward a broader research agenda: understanding how to nurture and evaluate machine creativity while maintaining essential safety guardrails, ultimately in the service of advancing scientific discovery. Through continued refinement, LiveIdeaBench aims to serve as a key tool in this evolving landscape of AI capability assessment and scientific innovation.
Finally, a practical consideration is the environmental cost associated with the extensive LLM usage required for comprehensive benchmarking68. Using the EcoLogits Calculator69 to estimate the carbon footprint of our benchmark evaluations, we calculate a total emission of approximately 3074 kgCO2eq for the full evaluation run reported here. While necessary for rigorous assessment, this highlights the significant energy demands of current AI systems. A detailed breakdown of the estimated emissions per model and per role (idea generator vs. judge) can be found in Supplementary Table S4.
Methods
Building upon Guilford’s foundational theory of divergent thinking, we develop a comprehensive evaluation methodology (see Fig. 1) that quantitatively assesses five fundamental dimensions in scientific idea generation. While Guilford’s original theoretical framework provides theoretical underpinnings, we extend and operationalize these concepts specifically for evaluating LLMs’ scientific idea generation capabilities within our LiveIdeaBench benchmark. It is important to note that our methodology evaluates an essential but not exhaustive aspect of scientific creativity, focusing primarily on divergent thinking capabilities.
The human expert validation process in this study was conducted in full accordance with relevant ethical regulations and was approved by the Scientific and Ethical Committee of Renmin University of China (Approval ID: L20250110). All experts provided written informed consent on a voluntary basis without compensation. The study design does not involve sex or gender variables.
Dimensions of evaluation
Originality
Originality assessment focuses on the uniqueness and novelty of generated ideas. We implement this through our critic system, where judge LLMs evaluate each idea’s originality independently (see Fig. 1b and Supplementary Note 1.2). The final originality score for each model is computed as the mean evaluation across all scientific keywords and generated ideas, providing an absolute score that reflects the model’s capacity for novel ideation. To ensure assessment reliability, each generated idea is evaluated by a minimum of three randomly assigned critic LLMs from our panel. This multiple-evaluator approach mitigates potential assessment bias that could arise from relying on a single model’s judgment, thereby enhancing the objectivity and reliability of our evaluation benchmark.
Feasibility
In the context of scientific innovation, the practical implementability and scientific soundness of an idea are paramount. Therefore, our evaluation includes a distinct feasibility dimension, assessing whether a proposed idea is technically achievable and aligns with established scientific principles and constraints. This aligns with the instructions given to the LLMs, which noted feasibility as a key characteristic of a good scientific idea (see Supplementary Note 1.1). Similar to originality and clarity, feasibility scores are determined by our critic system and averaged across all keywords and ideas to produce an absolute metric (see Supplementary Note 1.2). This ensures our benchmark evaluates the practical viability crucial for scientific progress.
Clarity
Our evaluation benchmark incorporates a clarity dimension, directly informed by the prompt provided to the idea-generating LLMs, which notes that good scientific ideas should be clearly articulated (see Supplementary Note 1.1). This dimension assesses the quality of the idea’s expression, focusing on its coherence, logical flow, and comprehensibility, particularly given the constraint of the 100-word limit, which demands concise articulation. While conceptually related to the elaboration aspect in Guilford’s creativity theory, our assessment prioritizes effective and understandable communication within the specified format. Like originality and feasibility, clarity scores are determined by our critic system (see Fig. 1b and Supplementary Note 1.2), involving multiple judges per idea and averaging the results. Assessing clarity acknowledges that the potential impact of a scientific idea depends not only on its novelty and feasibility but also on how effectively it is communicated24.
Fluency
Fluency assessment examines the model’s capacity to generate diverse, non-redundant ideas using identical keywords (see Fig. 1d). Through our judge panel, we evaluate the distinctiveness of generated outputs using a letter-grade scoring system: D indicates academically identical ideas; C represents similar ideas addressing similar problems; B denotes different ideas addressing similar problems; and A signifies completely different ideas addressing different problems. To align with the 1-10 integer scale used for all evaluation dimensions, these four qualitative grades are mapped linearly to the integer scores 1 (for D), 4 (for C), 7 (for B), and 10 (for A), respectively. This mapping ensures consistent scaling across dimensions and maintains equal intervals between the assessed qualitative distinctness levels, enabling precise measurement of genuine idea diversity versus surface-level variations (see Supplementary Note 1.3 for prompts). While simpler diversity metrics examining syntax or semantics would require fewer computational resources, we chose LLM-as-a-Judge for its ability to better capture the nuanced differences between genuinely distinct scientific ideas versus superficial variations. For a benchmark specifically designed to evaluate scientific divergent thinking capabilities, this precision is essential.
Flexibility
Flexibility measurement evaluates the model’s ability to maintain consistent performance across different scientific domains and contexts. Rather than treating flexibility as an independent metric, we derive it from the distribution of the combined scores (averaging originality, feasibility, clarity, and fluency) across various keywords. Following the principle that a system’s overall effectiveness is constrained by its weakest performing components, we focus on the 30th percentile of this composite score distribution (see Fig. 1e). This percentile choice provides a robust measure of a model’s performance floor while avoiding extreme outliers, enabling us to assess whether its scientific creativity can genuinely generalize to less common or niche domains. The resulting metric identifies models that maintain reliable performance across diverse scientific contexts rather than those exhibiting domain-specific excellence, thus providing a conservative estimate of cross-domain capabilities.
Scientific keyword selection
The keywords set comprises 1180 high-impact scientific keywords (Fig. 2c) across 22 distinct scientific disciplines, selected based on current search engine engagement metrics. Unlike static benchmarks, LiveIdeaBench updates its keyword database monthly to maintain alignment with emerging scientific trends and research frontiers. This automated refresh mechanism ensures the benchmark consistently reflects contemporary scientific discourse and technological advancement, making it particularly valuable for evaluating LLMs’ ability to engage with cutting-edge scientific concepts rather than just established knowledge.
Model selection
LiveIdeaBench maintains a continuously evolving roster of evaluated models by automatically incorporating the top 41 performers from the most recent LiveBench evaluations59. This dynamic selection process ensures our benchmark always tests the latest advancements in language model capabilities. We implement a dual-role system where all models serve as idea generators, while the top 10 performers additionally function as our judge panel (critics), subject to the diversity constraints outlined in the Experimental Protocol. This approach creates a self-updating evaluation benchmark that evolves alongside rapid developments in AI, ensuring that both idea generation and assessment standards reflect current state-of-the-art capabilities. The automatic monthly refresh of both models and evaluation criteria through LiveBench integration helps prevent benchmark staleness and potential gaming of the system, maintaining LiveIdeaBench’s relevance as a contemporary measure of scientific creativity.
Our evaluation benchmark currently encompasses 41 state-of-the-art LLMs based on LiveBench’s March 2025 results. This includes models from major developers such as Anthropic (claude-3.7-sonnet:thinking, claude-3.7-sonnet, claude-3.5-sonnet, claude-3-opus, claude-3.5-haiku-20241022)70; OpenAI (o3-mini-high, gpt-4.5-preview, o1, o3-mini, o1-mini, gpt-4o-2024-11-20, gpt-4-turbo, gpt-4o-mini)71; Google (gemini-2.0-flash-thinking-exp, gemini-2.0-pro-exp-02-05, gemini-2.0-flash-exp, gemini-pro-1.5, gemini-2.0-flash-lite-001, gemma-2-27b-it)72; Qwen (qwq-32b, qwen-max, qwen2.5-dracarys2-72b, qwen-2.5-72b-instruct, qwen-2.5-coder-32b-instruct, qwq-32b-preview, qwen-2.5-7b-instruct)73,74; DeepSeek (deepseek-r1, deepseek-chat (v3), deepseek-r1-distill-llama-70b, deepseek-r1-distill-qwen-32b)75–77; Meta (llama-3.1-405b-instruct, llama-3.3-70b-instruct, llama-3.1-70b-instruct)78; Mistral (mistral-large-2411, mistral-small-24b-instruct-2501, mistral-small (v2409))79; Amazon (nova-pro-v1, nova-lite-v1)80; StepFun (step-2-16k-202411); xAI (grok-2-1212); and Microsoft (phi-4)81. This comprehensive set includes both proprietary and open-weight models, spanning diverse architectures, parameter scales, and training methodologies.
Experimental protocol
We implemented several methodological controls to ensure rigorous evaluation:
Model selection criteria for idea generation
To prevent redundancy in the pool of idea-generating models, we selected only the most recent version of models with multiple temporal variants (e.g., GPT-4o series). Models exhibiting API instability during the evaluation period were also excluded to maintain data quality consistency across all evaluated models.
Judge LLM panel formation, independence, and application
The judge panel, comprising the top 10 models from LiveBench, is formed while applying specific diversity constraints to mitigate potential correlated biases. To ensure broader representation, we limit the contribution from any single organization to a maximum of 20% (i.e., 2 models) of the panel; if the initial top 10 includes more than two models from one organization, the lower-ranked ones are replaced by the next highest-ranked eligible models from different organizations. Additionally, when considering model pairs with identical base models that differ primarily in “reasoning effort” (e.g., o3-mini vs. o3-mini-high), we select only one representative for the judge panel (prioritizing the variant with higher general intelligence scores on LiveBench) to avoid redundancy and the potential amplification of biases inherent to that specific base model. Furthermore, to prevent circular dependency during evaluation, we implement strict independence: when evaluating any specific model’s generated ideas, that model is explicitly excluded from serving on the judge panel for that evaluation round, ensuring independent assessment free from self-evaluation. This established panel is then utilized through specific sampling procedures for evaluation: For assessing originality, feasibility, and clarity, each individual-generated idea is evaluated by a subset of 3 judges randomly sampled from the remaining eligible panel members. The final score for each of these dimensions is the average of the scores provided by these three judges, enhancing assessment robustness. For assessing fluency, which evaluates the diversity of ideas generated for the same keyword by a given model, the comparison is performed by a single judge randomly sampled from the eligible panel members for each keyword-model pair.
Response standardization
All models were prompted to generate ideas within a 100-word target length, with a maximum allowable threshold of 200 words (see Supplementary Note 1.1). Responses exceeding this limit were excluded from analysis to ensure comparative validity across models.
Special implementations
The reasoning-centric architecture of qwq-32b-preview necessitated a modified protocol, incorporating a “**Final Idea:**” delimiter for response parsing (see Supplementary Note 1.4). In cases where parsing failed, critic LLMs evaluated the complete reasoning output to maintain assessment comprehensiveness.
Handling refused responses
To fairly assess models, especially those with strong safety alignments that might refuse prompts for sensitive keywords, we implemented a two-step refusal handling protocol. If an initial idea generation request is refused (detected via specific keywords detailed in Supplementary Note 5), a fallback prompt reframing the task within an academic context is used for a second attempt. This ensures models are not unduly penalized for safety constraints when they might still be capable of generating relevant scientific ideas under appropriate framing. Further details and refusal rates are provided in Supplementary Note 5.
Statistical analysis
Statistical correlations between evaluation metrics were assessed using either the Pearson correlation coefficient (r) or Spearman’s rank correlation coefficient (ρ), depending on the normality of the data distributions. Data normality was evaluated using the Shapiro-Wilk test. Pearson’s r was applied to normally distributed continuous variables (e.g., comparing LiveIdeaBench and LiveBench scores), while Spearman’s ρ was utilized for non-normally distributed data (e.g., analyzing the relationship between idea length and quality scores). Detailed mathematical definitions and formulations for these statistical metrics are provided in Supplementary Note 9.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
The work is supported by the National Natural Science Foundation of China (No. 92270118, No. 62276269), the Beijing Natural Science Foundation (No. 1232009), and the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB0620103). In addition, H.S. and Y.L. would like to acknowledge the support from the Fundamental Research Funds for the Central Universities (Nos. 202230265 and E2EG2202X2).
Author contributions
K.R., X.W., J.H., P.W., Y.L., and H.S. contributed to the ideation and design of the research; K.R., X.W. and J.H. performed the research; H.S. supervised the project; all authors contributed to the research discussions, writing, and editing of the paper.
Peer review
Peer review information
Nature Communications thanks Yoed Kenett, who co-reviewed with Tuval Raz; Giorgio Franceschelli; Janet Rafner and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The data generated in this study have been deposited in Huggingface at https://huggingface.co/datasets/6cf/liveideabench-v2(also see the Zenodo repository at 10.5281/zenodo.17707879).
Code availability
All the source codes used to reproduce the results in this study are available on GitHub at https://github.com/x66ccff/liveideabench. (also see the Zenodo repository at 10.5281/zenodo.17707647).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-026-70245-1.
References
- 1.Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature620, 47–60 (2023). [DOI] [PubMed] [Google Scholar]
- 2.Shi, F. & Evans, J. Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines. Nat. Commun.14, 1641 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.AI4Science, M. R. & Quantum, M. A. The impact of large language models on scientific discovery: a preliminary study using GPT-4. Preprint at 10.48550/arXiv.2311.07361.
- 4.Krenn, M. et al. Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network. Nat. Mach. Intell.5, 1326–1335 (2023). [Google Scholar]
- 5.Rafner, J., Beaty, R. E., Kaufman, J. C., Lubart, T. & Sherson, J. Creativity in the age of generative AI. Nat. Hum. Behav.7, 1836–1838 (2023). [DOI] [PubMed] [Google Scholar]
- 6.Getzels, J.W., Jackson, P.W. Creativity and Intelligence: Exploration with Gifted Students (John Wiley & Sons, 1962).
- 7.Guilford, J.P. The nature of human intelligence (McGraw-Hill, 1967).
- 8.Gralewski, J., Weremczuk, E. & Karwowski, M. Intelligence and creativity of Polish middle-school students: Looking for the threshold hypothesis. New Educ. Rev.29, 328–338 (2012). [Google Scholar]
- 9.Jauk, E., Benedek, M., Dunst, B. & Neubauer, A. C. The relationship between intelligence and creativity: New support for the threshold hypothesis by means of empirical breakpoint detection. Intelligence41, 212–221 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kim, K. H. Can only intelligent people be creative? A meta-analysis. J. Second. Gift. Educ.16, 57–66 (2005). [Google Scholar]
- 11.Preckel, F., Holling, H. & Wiese, M. Relationship of intelligence and creativity in gifted and non-gifted students: an investigation of threshold theory. Personal. Individ. Differ.40, 159–170 (2006). [Google Scholar]
- 12.Weiss, S., Steger, D., Schroeders, U. & Wilhelm, O. A reappraisal of the threshold hypothesis of creativity and intelligence. J. Intell.8, 38 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Park, G., Lubinski, D. & Benbow, C. P. Ability differences among people who have commensurate degrees matter for scientific creativity. Psychol. Sci.19, 957–961 (2008). [DOI] [PubMed] [Google Scholar]
- 14.Runco, M. A. & Albert, R. S. The threshold theory regarding creativity and intelligence: an empirical test with gifted and nongifted children. Creat. Child Adult Q.11, 212–218 (1986). [Google Scholar]
- 15.Benedek, M. On the relationship between creative potential and creative achievement: challenges and future directions. Learn. Individ. Differ.110, 102424 (2024). [Google Scholar]
- 16.Guilford, J. P. Creativity. Am. Psychol.5, 444–454 (1950). [DOI] [PubMed] [Google Scholar]
- 17.Cortes, R. A., Weinberger, A. B., Daker, R. J. & Green, A. E. Re-examining prominent measures of divergent and convergent creativity. Curr. Opin. Behav. Sci.27, 90–93 (2019). [Google Scholar]
- 18.Boden, M. A. The Creative Mind: Myths And Mechanisms (Routledge, 2004).
- 19.Amabile, T. M. The social psychology of creativity: a componential conceptualization. J. Personal. Soc. Psychol.45, 357 (1983). [Google Scholar]
- 20.Rhodes, M. An analysis of creativity. Phi Delta Kappan42, 305–310 (1961). [Google Scholar]
- 21.Rafner, J. et al. Towards game-based assessment of creative thinking. Creat. Res. J.35, 763–782 (2023). [Google Scholar]
- 22.Meincke, L., Girotra, K., Nave, G., Terwiesch, C. & Ulrich, K.T. Using large language models for idea generation in innovation. The Wharton School Research Paper Forthcominghttps://ssrn.com/abstract=4526071 (2024).
- 23.Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. The 13th International Conference on Learning Representations (2025).
- 24.Lee, B. C. & Chung, J. An empirical investigation of the impact of ChatGPT on creativity. Nat. Hum. Behav.8, 1906–1914 (2024). [DOI] [PubMed] [Google Scholar]
- 25.Cropley, D. Is artificial intelligence more creative than humans? ChatGPT and the divergent association task. Learn. Lett.2, 13–13 (2023). [Google Scholar]
- 26.Marrone, R., Cropley, D. & Medeiros, K. How does narrow AI impact human creativity? Creat. Res. J.https://psycnet.apa.org/doi/10.1080/10400419.2024.2378264 (2024).
- 27.Wenger, E. & Kenett, Y. We’re different, we’re the same: creative homogeneity across LLMs. Preprint at 10.48550/arXiv.2501.19361 (2025).
- 28.Doshi, A. R. & Hauser, O. P. Generative AI enhances individual creativity but reduces the collective diversity of novel content. Sci. Adv.10, https://api.semanticscholar.org/CorpusID:271119565 (2023). [DOI] [PMC free article] [PubMed]
- 29.Lu, Y. et al. Benchmarking language model creativity: a case study on code generation. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Chiruzzo, L., Ritter, A. & Wang, L.) 2776–2794 (ACL, 2025).
- 30.Zhao, Y., Zhang, R., Li, W. & Li, L. Assessing and understanding creativity in large language models. Mach. Intell. Res.22, 417–436 (2025).
- 31.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst.35, 24824–24837 (2022). [Google Scholar]
- 32.Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst.35, 22199–22213 (2022).
- 33.Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst.36, 11809–11822 (2023).
- 34.Li, X.-Y., Xue, J.-T., Xie, Z. & Li, M. Think outside the code: Brainstorming boosts large language models in code generation. Preprint at 10.48550/arXiv.2305.10679 (2023).
- 35.Lu, C. et al. The AI Scientist: Towards fully automated open-ended scientific discovery. Preprint at 10.48550/arXiv.2408.06292 (2024).
- 36.Hu, X. et al. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM-generated ideas. Preprint at 10.48550/arXiv.2410.14255 (2024).
- 37.Baek, J., Jauhar, S. K., Cucerzan, S. & Hwang, S. J. Researchagent: Iterative research idea generation over scientific literature with large language models. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Chiruzzo, L., Ritter, A. & Wang, L.) 6709–6738 (ACL, 2025).
- 38.Radensky, M. et al. Scideator: Human-LLM scientific idea generation grounded in research-paper facet recombination. Preprint at 10.48550/arXiv.2409.14634 (2024).
- 39.Gottweis, J. et al. Towards an AI co-scientist. Preprint at 10.48550/arXiv.2502.18864 (2025).
- 40.Wang, Q., Downey, D., Ji, H., & Hope, T. Scimon: Scientific inspiration machines optimized for novelty. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 279–299 (ACL, 2024).
- 41.Gu, X. & Krenn, M. Interesting scientific idea generation using knowledge graphs and LLMs: evaluations with 100 research group leaders. Preprint at 10.48550/arXiv.2405.17044 (2024).
- 42.Gu, X. & Krenn, M. Forecasting high-impact research topics via machine learning on evolving knowledge graphs. Mach. Learn.: Sci. Technol.6, 025041 (2025).
- 43.Wang, W. et al. SciPIP: An LLM-based scientific paper idea proposer. Preprint at 10.48550/arXiv.2410.23166 (2024).
- 44.Pu, K. et al. Ideasynth: Iterative research idea development through evolving and composing idea facets with literature-grounded feedback. In Proc. 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–31 (2025).
- 45.Dubois, Y., Galambosi, B., Liang, P. & Hashimoto, T. B. Length-controlled AlpacaEval: a simple way to debias automatic evaluators. Preprint at 10.48550/arXiv.2404.04475 (2024).
- 46.Zheng, L. et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Adv. Neural Inf. Process. Syst.36, 46595–46623 (2023). [Google Scholar]
- 47.Li, T. et al. From crowdsourced data to high-quality benchmarks: Arena-Hard and Benchbuilder pipeline. In Proc. 42nd International Conference on Machine Learning (Des. Singh, A. et al.) Vol. 267, 34209–34231 (PMLR, 2025).
- 48.Chiang, C.-H. & Lee, H.-y. Can large language models be an alternative to human evaluations? In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 15607–15631 (ACL, 2023).
- 49.Chen, Y., Wang, R., Jiang, H., Shi, S. & Xu, R. Exploring the use of large language models for reference-free text quality evaluation: an empirical study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), 361–374 (ACL, 2023).
- 50.Huang, F., Kwak, H., Park, K., & An, J. ChatGPT rates natural language explanation quality like humans: but on which scales? In Proc. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (eds. Calzolari, N. et al.) 3111–3132 https://aclanthology.org/2024.lrec-main.277/ (ELRA and ICCL, 2024).
- 51.Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA. 120, e2305016120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Verga, P. et al. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models. Preprint at 10.48550/arXiv.2404.18796 (2024).
- 53.Badshah, S. & Sajjad, H. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA. In Proc. 9th Widening NLP Workshop (eds Zhang, C. et al.) 251–267 (ACL, 2025).
- 54.Kim, S. et al. Prometheus 2: An open source language model specialized in evaluating other language models. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 4334–4353 (ACL, 2024).
- 55.Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature625, 476–482 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Rafner, J. et al. Mapping citizen science through the lens of human-centered AI. Hum. Comput.9, 66–95 (2022). [Google Scholar]
- 57.Akata, Z. et al. A research agenda for hybrid intelligence: augmenting human intellect with collaborative, adaptive, responsible, and explainable artificial intelligence. Computer53, 18–28 (2020). [Google Scholar]
- 58.Dellermann, D., Ebel, P., Söllner, M. & Leimeister, J. M. Hybrid intelligence. Bus. Inf. Syst. Eng.61, 637–643 (2019). [Google Scholar]
- 59.White, C. et al. LiveBench: A challenging, contamination-limited LLM benchmark. The 13th International Conference on Learning Representations (2025).
- 60.Jain, N. et al. LiveCodeBench: holistic and contamination free evaluation of large language models for code. The 13th International Conference on Learning Representations (2025).
- 61.Wei, J., Huang, D., Lu, Y., Zhou, D., & Le, Q. V. Simple synthetic data reduces sycophancy in large language models. Preprint at https://arxiv.org/abs/2308.03958 (2024).
- 62.Gu, J. et al. A survey on LLM-as-a-Judge. The Innovation10.1016/j.xinn.2025.101253 (2024).
- 63.Fanous, A. et al. Syceval: Evaluating llm sycophancy. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 893–900 (2025).
- 64.Huang, L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43, 1–55 (2025). [Google Scholar]
- 65.Falk, J. et al. How do hackathons foster creativity? towards ai collaborative evaluation of creativity at scale. In Proc. 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N., et al.) 198 (ACM, 2025).
- 66.DiStefano, P. V. et al. Evaluating AI’s ideas: the role of individual creativity and expertise in human-AI co-creativity. Preprint at https://osf.io/preprints/psyarxiv/k2u87 (2025).
- 67.Sourati, J. & Evans, J. A. Accelerating science with human-aware artificial intelligence. Nat. Hum. Behav.7, 1682–1696 (2023). [DOI] [PubMed] [Google Scholar]
- 68.Lacoste, A., Luccioni, A., Schmidt, V. & Dandres, T. Quantifying the carbon emissions of machine learning. Preprint at 10.48550/arXiv.1910.09700 (2019).
- 69.Rince, S. & Banse, A. EcoLogits: evaluate the environmental impacts of generative AI. J. Open Source Softw.10, 7471 (2025).
- 70.Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Preprint at https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (2024).
- 71.Achiam, J. et al. GPT-4 technical report. Preprint at 10.48550/arXiv.2303.08774 (2023).
- 72.Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at 10.48550/arXiv.2403.05530 (2024).
- 73.Bai, J. et al. Qwen technical report. Preprint at 10.48550/arXiv.2309.16609 (2023).
- 74.Qwen: et al. Qwen2.5 technical report. Preprint at https://arxiv.org/abs/2412.15115 (2025).
- 75.Liu, A. et al. Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. Preprint at 10.48550/arXiv.2405.04434 (2024).
- 76.DeepSeek-AI. DeepSeek-V3 technical report. Preprint at 10.48550/arXiv.2412.19437 (2025).
- 77.Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature645, 633–638 (2025). [DOI] [PMC free article] [PubMed]
- 78.Dubey, A. et al. The Llama 3 herd of models. Preprint at 10.48550/arXiv.2407.21783 (2024).
- 79.Jiang, A. Q. et al. Mistral 7B. Preprint at 10.48550/arXiv.2310.06825 (2023).
- 80.Amazon Artificial General Intelligence. The Amazon Nova Family of Models: Technical Report and Model Card. Preprint at 10.48550/arXiv.2506.12103 (2024).
- 81.Abdin, M. et al. Phi-4 technical report. Preprint at 10.48550/arXiv.2412.08905 (2024).
- 82.Beltagy, I., Lo, K., & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K., Jiang, J., Ng, V. & Wan, X.) 3615–3620 (ACL, 2019).
- 83.Cohen, E. The boundary lens: Theorising academic activity. In: The University and its Boundaries: Thriving or Surviving in the 21st Century, 14–41 (Routledge, 2021).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data generated in this study have been deposited in Huggingface at https://huggingface.co/datasets/6cf/liveideabench-v2(also see the Zenodo repository at 10.5281/zenodo.17707879).
All the source codes used to reproduce the results in this study are available on GitHub at https://github.com/x66ccff/liveideabench. (also see the Zenodo repository at 10.5281/zenodo.17707647).






