Skip to main content
National Science Review logoLink to National Science Review
. 2025 Jul 14;12(8):nwaf271. doi: 10.1093/nsr/nwaf271

Synergizing a knowledge graph and large language model for relay catalysis pathway recommendation

Fei Fu 1,d, Qing-Qing Li 2,d, Fangrong Wang 3, Jie Hu 4, Tian-Tian Wang 5, Yun-Pei Liu 6, Weihong Xu 7, Zhili Lin 8, Fu-Qiang Gong 9, Qi-Yuan Fan 10,11, Jeff Z Pan 12,, Ye Wang 13,, Jun Cheng 14,15,16,
PMCID: PMC12374726  PMID: 40860016

ABSTRACT

Relay catalysis integrates multiple catalytic reactions to efficiently transform intermediates and enhance conversion and selectivity. However, designing these pathways and multifunctional catalysts is often lengthy and costly, heavily relying on in-depth literature analysis by experienced researchers. To address this, we developed an approach that combines a knowledge graph (KG) and large language models (LLMs) to automatically recommend multistep catalytic reaction pathways. Our method involves using an LLM-assisted workflow for data acquisition and organization, followed by the construction of a detailed catalysis knowledge graph (Cat-KG). After querying the Cat-KG, promising relay catalysis pathways are identified by applying scoring rules informed by expertise in relay catalysis. The LLM then transforms the structured pathways and reaction condition data into readable chemical equations and descriptions for chemists. This step integrates catalysis knowledge from the Cat-KG and helps avoid LLM-induced hallucinations by using reliable information. The method efficiently recommended relay catalysis pathways for ethylene, ethanol, 2,5-furandicarboxylate and other targets within minutes, identifying pathways consistent with reported ones while using different reaction conditions, validating its effectiveness. Thus, this strategy can extrapolate known and novel relay catalysis pathways, showcasing its potential for application in pathway selection.

Keywords: relay catalysis, knowledge graph, large language model, generative pre-trained transformer

INTRODUCTION

Relay catalysis, a strategy that integrates multiple catalytic reactions into a single, multifunctional system, has gained attention in the field of catalysis [1–5]. This approach boosts thermodynamic and kinetic efficiencies in multistep processes through simultaneous co-catalysis, outperforming traditional single-function catalysts in terms of synthesis efficiency, selectivity and atom economy, while reducing costs, optimizing energy and enhancing environmental sustainability [2]. Relay catalysis has shown promising results in fields like Fischer-Tropsch synthesis [6,7], where the design concept of oxide-zeolite composite catalysts has overcome selectivity limitations imposed by the Anderson–Schulz–Flory distribution [5], achieving remarkable selectivity in the direct conversion of syngas to mixed light olefins [8], ethylene [1] and ethanol [9,10]. However, the process of identifying individual catalytic steps and combining them in a coordinated manner, along with determining the optimal reaction conditions, is still a lengthy and challenging endeavor [10]. This complexity arises from three main challenges: the dispersion of relevant knowledge and data across various sources, which complicates data collection; the inefficiency of human analysis in matching catalysts and reaction conditions, which makes the process time-consuming and subjective; the complexity of validating proposed pathways, which requires costly and comprehensive experimental efforts. These challenges make relay catalysis research both demanding and resource-intensive.

Knowledge graphs (KGs) emerge as a promising tool to address these challenges [11]. In KGs, information is represented as a network of interconnected nodes and edges, where nodes represent entities (such as reactants, catalysts, solvents and products), and edges depict the relationships between these entities. With their unique structure of nodes representing entities and edges depicting relationships, KGs have proven successful in various fields [12,13]. Grzybowski and co-workers have utilized networks to encapsulate knowledge in organic reactions and molecular structures, mapping expansive networks of chemical reactions [14–16]. These networks, which can be considered as precursors to modern KGs, facilitated the identification of key molecules and the discovery of synthetic pathways, including pathways optimized for one-pot synthesis [17]. KGs have also been applied to catalyst prediction and reaction condition optimization in organic reactions, demonstrating their unique potential in supporting chemical reaction design [18]. KGs can be tailored to address specific challenges in relay catalysis by offering advantages in three areas: (i) efficiently consolidating dispersed data from various resources, e.g. literature, books and chemists’ experiential knowledge [19]; (ii) simplifying complex aspects of relay catalysis into a coherent knowledge base, addressing inefficiencies in manual analysis [16,19–21]; (iii) supporting reasoning [11,19,22] and predictive tasks, aligning with the complexities in pathway validation, thus reducing time and resources required in experimental validation. Compared to static relational databases and flat data storage systems, KGs offer dynamic adaptability and semantic richness, enabling them to evolve with new information [11,21,23]. Therefore, we propose constructing a KG tailored specifically for the intricate aspects of relay catalysis to provide a strategic tool for exploring and identifying effective catalytic reaction pathways and conditions.

Constructing a KG involves several steps, including schema design, data collection, data cleaning and knowledge fusion [24,25], and knowledge updating [11,21,23]. Selecting appropriate data sources is crucial for the quality of the KG. Existing databases relevant to chemistry, catalysis and materials science, such as the Open Reaction Database [26,27], Reaxys [28] and SciFinder [29], were assessed. While these resources offer high-quality data and extensive information on substance transformations, selectivity yields and reaction conditions, they lack detailed information on catalyst synthesis methods and the physical and chemical properties of catalysts. This gap is particularly significant in relay catalysis, where detailed catalyst information is essential for designing multifunctional catalysts and generating effective pathways. To bridge these information gaps, we propose collecting in-depth, reliable data on catalytic reactions from high-quality scientific publications to construct a KG. This method aims to enrich the knowledge base with precise experimental conditions and catalyst specifics, enhancing data accuracy and integrity. The KG approach offers several advantages, such as integrating real-time updates [11], storing reaction information in a structured manner [21] and enabling advanced reasoning capabilities [19,30]. Implementing this approach facilitates the construction of a high-quality catalysis KG, thereby bolstering relay catalysis research.

Extracting high-quality catalytic reaction data from scientific literature presents multiple challenges. These challenges include the lack of standardization and organization due to various formats and methods employed by different research groups and publications. Additionally, the wide range of reaction types and conditions complicates data extraction, and a deep understanding of chemistry is needed to accurately distinguish between target reactions and similar reactions [31,32]. Traditional data acquisition relies on pre-trained language models (PLMs) [31,33] for tasks such as named entity recognition, relationship extraction and text classification. However, this process depends on manual annotation, pre-training and fine-tuning of models, leading to low efficiency, strong subjectivity and challenges in ensuring quality knowledge extraction, while also requiring extensive programming and data science knowledge [31]. In contrast, the emergence of advanced large language models (LLMs) such as generative pre-trained transformers (GPTs)[34], Claude3.5, Gemini [35] and GPT-4 [36] marks a significant shift in the era of AI-generated content. These models possess robust general knowledge capabilities and an understanding of domain-specific content [21,37], demonstrating potential in simplifying the extraction of chemical data [38,39]. Unlike PLMs, LLMs can utilize semantic understanding and generation to extract knowledge, improving the accuracy and coverage of knowledge extraction, while significantly reducing the time and effort required for data annotation, making it possible to construct a high-quality, large-scale catalysis KG at low cost [21,30].

In this research, we present a workflow that combines the strengths of a KG and an LLM to efficiently recommend relay catalysis pathways. Our approach consists of two main phases: constructing the catalysis knowledge graph (Cat-KG) and using it to recommend reaction pathways (Fig. 1). For the Cat-KG construction, we use Gemini’s data extraction abilities to gather comprehensive catalytic reaction data from literature. The resulting Cat-KG covers 15 881 publications and 27 760 thermocatalytic reactions, of which 18 174 are heterogeneous catalysis reactions. We then use the Cat-KG to identify promising relay catalysis pathways by searching through it and applying scoring rules to filter out high-quality pathways and their key reaction information from numerous candidates obtained. Then, GPT-4 is used to convert the structured text into natural language descriptions and easily readable chemical equations. This KG-enhanced LLM approach reduces hallucinations, ensures recommendation effectiveness and maintains traceability to the original literature. Our method recommended reaction pathways for ethylene, ethanol and 2,5-furandicarboxylate, including pathways consistent with reported literature, using different reaction conditions, and achieving results significantly faster than manual efforts. This strategy enhances the exploration of reaction pathways by combining expert-designed scoring rules, KG and LLM, enabling chemists to efficiently identify known effective pathways and discover new relay catalysis pathways.

Figure 1.

Figure 1.

Workflow diagram of the construction of the Cat-KG and application for relay catalysis. The initial phase involves the selection of a catalytic schema and the acquisition of data from catalysis-centric literature to establish the Cat-KG (blue), followed by the recommendation of a reaction pathway (purple). Shared steps are marked with the corresponding color.

RESULTS AND DISCUSSION

Framework construction

Our workflow harnesses the combined strengths of KGs and LLMs to efficiently generate and evaluate relay catalysis pathways, as shown in Fig. 1. We begin by constructing the Cat-KG using an LLM-assisted methodology to accurately gather and structure catalytic reaction data from the literature, transforming scattered information into a coherent catalysis database. The Cat-KG provides detailed, traceable and structured catalytic reaction information. Next, the Cat-KG is used to search for potential pathways. Manually designed scoring rules are applied to filter and recommend high-quality paths, which are then output as structured and organized information. LLMs are then employed to convert the structured recommended relay catalysis pathway information into a more accessible natural language format, ensuring both accuracy and readability. This workflow highlights the unique capabilities of both KGs and LLMs, demonstrating how their integration can advance catalysis research by streamlining the data collection and analysis process.

Knowledge graph construction

KGs can be constructed using two main methodologies: top down and bottom up [40]. The top-down method starts by defining a schema with high-level concepts, which are then refined into a detailed taxonomy and linked with specific entities like catalytic reactions and catalysts. This method is particularly useful for building domain-specific KGs. Given the interconnected nature of reactants, catalysts, solvents and products, the top-down approach was selected for the construction of the Cat-KG. As illustrated in Fig. 1, the construction process involved schema design, data collection, cleaning, entity disambiguation, data storage and knowledge updating. These steps resulted in a comprehensive and accurate representation of catalytic reactions and catalysts.

To construct a detailed and professional KG in the catalysis domain, a comprehensive schema was developed to represent the highest-level concepts in this field. The schema was organized into a structured hierarchy consisting of five essential classes: reaction, reactant, catalyst, solvent and product. These classes are interconnected through four types of relationships, as illustrated in Fig. 2, and the schema encompasses 29 attributes. Each class and its relationships are closely linked, ensuring that all critical catalysis-related data are accurately captured. This well-defined schema serves as the foundation for the construction of the Cat-KG, allowing for the detailed representation and analysis of catalytic reactions within this structured framework.

Figure 2.

Figure 2.

Schema of the Cat-KG. The catalytic reaction data structure utilized in this study encompasses five core schema classes and 29 unique attributes to thoroughly represent the complexity of catalytic processes.

To obtain key catalytic reaction information, a workflow was designed to efficiently gather high-quality data, as shown in Fig. 3. Additionally, the Chem-Brain platform (see Fig. S48) was developed and implemented to manage the catalysis literature and their reaction data, facilitate data sharing among researchers and support interactive applications with the Cat-KG.

Figure 3.

Figure 3.

Workflow of catalysis data acquisition and automated extraction. Panel (a) outlines the process of extracting and validating structured data from the literature, employing a Python-based script for parsing and an LLM for ensuring the reliability of the data. Panel (b) delineates the automated sequence for extracting catalytic reaction data from literature information in the Chem-Brain system, which includes generating a reaction overview and reaction details via an LLM, followed by the data storage phase.

With the support of the Chem-Brain data management system, we collected, processed and stored data from peer-reviewed publications that contain comprehensive information on catalytic reactions. We gathered 15 881 publications related to thermal catalysis from reputable publishers (see the Methods section within the online supplementary material). We then designed a literature parsing process to automatically extract relevant information on catalysis quality and details from these publications. This included information such as publication date, journal name and citation count, which reflect the reliability and relevance of the reactions, as well as text from abstracts, introductions, main texts, figure captions and table notes, which provide specific reaction parameters.

To extract complete and high-quality catalytic reaction data from the literature, we developed a workflow as shown in Fig. 3. LLMs have demonstrated outstanding performance in general chemical knowledge understanding and reasoning-based question answering [41,42], chemical experiment planning and design [43–45], as well as in literature information integration and data extraction from chemical texts [21,38,46–48]. LLMs can gather a large amount of data more cost-effectively and efficiently than traditional manual annotation and named entity recognition techniques. However, previous reports in the chemical field usually extract data from single paragraphs [38,49]. This approach is not suitable for extracting detailed catalytic reaction data because it does not capture comprehensive information. For example, product yields are typically reported in the results section, while key information about the catalyst preparation method and catalyst support is found in the methods section. Therefore, single-paragraph extraction methods are insufficient for comprehensive data collection, prompting the need to explore full-text extraction.

Full-text extraction introduces additional challenges, primarily due to the decreased performance of LLMs with longer prompts. To overcome these issues, three strategies were implemented. First, a sequential extraction process was employed, dividing the task into two steps to reduce complexity and improve the results, as shown in Fig. 3b. Second, the Gemini 1.5 Flash model, which is well suited for this task, was utilized to improve the understanding of long texts and enhance extraction accuracy. Lastly, prompt engineering was applied to design reliable prompts that ensure consistency and completeness in the extracted data (details provided in the Methods section within the online supplementary material).

The automated catalysis data acquisition and extraction workflow shown in Fig. 3 was used to process 15 881 catalysis articles, resulting in the extraction of 27 760 thermocatalytic reactions, including 18 174 heterogeneous catalysis reactions. Thanks to the implementation of the full-text extraction strategy and the methods that improved its performance, the workflow was able to capture reactions with rich key information. The extracted literature information, reaction overviews and reaction details were first structured in JSON format (Figs S50 and S52) using automated scripts and then systematically stored and organized in the Chem-Brain platform.

To evaluate the effectiveness and quality of data extraction, a representative set of reactions was selected from the reaction pathways of 10 different compounds, resulting in 44 reactions, as detailed in Section S2 within the online supplementary material. By calculating and analyzing the cumulative precision (P) and recall (R) curves, we observed minimal fluctuations, indicating high stability. The calculated values of P, R and F1, along with their standard errors, were as follows: mean precision, 91.49% (SE 1.07%); mean recall, 91.18% (SE 1.04%); mean F1 score, 0.9113 (SE 0.0086). These results demonstrate that the data extraction strategy is effective, providing high-value data. The detailed data and the methods for calculating P, R and F1 can be found in Section S2 within the online supplementary material.

Despite the use of the advanced LLM, Gemini, some issues such as errors, omissions, duplicates and inconsistent formats were observed, which are common when dealing with large datasets [21,42]. To address these, a data cleaning and entity disambiguation process was implemented, incorporating a rule-based automated system to improve efficiency. This process significantly improved data quality, and the cleaned and disambiguated data were stored in structured JSON format, providing a foundation for the construction of the KG and subsequent search applications. In addition, the entire automation process is modular, allowing each component to be updated or replaced as needed. For example, as LLMs evolve rapidly, the system can switch to different LLM application programming interfaces to take advantage of newer models. Carefully designed prompts restrict the output format, data structure and key-value types, and robust parsing logic ensures consistency despite stylistic differences across models. These design choices support seamless LLM replacement without disrupting the Cat-KG construction pipeline.

To effectively transform the extracted data into a visual KG, Neo4j [40,50], a high-performance graph database designed to handle complex network relationships, was selected. Using this approach, data from 15 881 catalysis publications were successfully converted into a Neo4j-based KG (see the Methods section within the online supplementary material for a detailed construction). This KG not only improved data accessibility and visualization, but also provided a solid foundation for complex queries and analyses.

Currently, we have successfully established a comprehensive Cat-KG in the field of catalysis on Neo4j, adhering to the construction process outlined earlier (details in the Methods section within the online supplementary material). The Cat-KG is continuously updated and expanded with newly extracted catalytic reaction data through a dynamic update mechanism, which ensures that the KG remains up to date with the evolving body of literature. As part of this update process, a rule-based entity disambiguation procedure was implemented to resolve inconsistencies and ambiguities in chemical entities across different sources, thereby improving data quality, consistency and interoperability. The overall design, implementation strategies and advantages of the dynamic update mechanism, including the integrated entity disambiguation workflow, are described in detail in Section S3 within the online supplementary material.

At present, the Cat-KG encompasses 15 881 catalysis publications and 27 760 thermocatalytic reactions. The reaction information includes five schema classes, four types of relationships and 29 key catalytic attributes. Its extensive coverage of catalytic reactions provides detailed information for downstream applications, offering a wealth of catalysis knowledge and supporting logical reasoning. Moreover, the Cat-KG assists chemical researchers in querying and retrieving catalytic reaction information, with full traceability to the original literature sources. For instance, using the Chem-Brain platform (https://ai4ec.ikkem.com/apps/chembrain) users can search for catalytic chemical reactions based on catalysts, reactants and products. All data sources used to construct the Cat-KG were obtained through authorized and compliant means; for more details, see the LLM-Assisted Presentation of Relay Catalysis Pathway subsection under the Methods section within the online supplementary material. These values make it a valuable tool for exploring and analyzing catalysis.

Relay catalysis application

To efficiently identify and recommend potential relay catalysis pathways, an automated recommendation system was developed using the Cat-KG (Fig. 4). The system is designed to suggest possible pathways and reaction conditions based on existing knowledge, serving as a foundation for further experimental exploration. Chemists can then use these recommendations as a starting point to refine and optimize reaction parameters through subsequent experiments.

Figure 4.

Figure 4.

Relay catalysis pathway query, filtering and recommendation process using the Cat-KG. Panel (a) represents the schematic of the Cat-KG, where candidate pathways are generated via specific KG queries. Panel (b) represents the filtering and prioritizing of these pathways using scoring rules. Panel (c) presents the recommended relay catalysis pathways, summarized and formatted by an LLM for clarity.

Specifically, the workflow integrates the traceability of reaction information from the Cat-KG, the interpretability of chemistry-based filtering rules and the language processing capabilities of LLMs [21,30]. It automatically generates and executes Neo4j queries through Python scripts, ranks the identified pathways and processes the top-ranked results into structured JSON format. These structured data are then converted into natural language descriptions by an LLM, making the information more accessible for researchers.

First, Cypher, the query language used for interacting with graph databases such as Neo4j [50], is used to search the Cat-KG for candidate relay catalysis pathways based on a specified target product and pathway length. Users can also specify reaction conditions, starting materials or intermediates to refine the search. For example, when querying crotonaldehyde as the target product with HInline graphic and CO as starting materials and a pathway length limit of three steps (Fig. 4), the system can identify over 93 000 candidate pathways within a minute. To enhance usability, we developed a user-friendly query interface that allows users to specify pathway constraints (Fig. S57). The recommendation system automatically generates and executes the corresponding Cypher queries based on user input, enabling researchers to interact with the system without needing to learn Cypher syntax. This approach facilitates more precise searching and recommendation of expert-preferred relay catalysis pathways, while also improving the objectivity and customizability of the recommendation system.

However, many of the candidate pathways are impractical due to incompatibilities in reaction conditions. Issues such as large temperature differences or the presence of incompatible additives (for example, OInline graphic and HInline graphic) in the same pathway are common. To address this, a set of scoring rules (Fig. 4b) was designed to evaluate the validity and feasibility of the pathways, based on our understanding and experience in relay catalysis design. These rules assess factors such as the reliability of the reaction sources, the compatibility of key reaction conditions and catalysts in adjacent steps, and the overall atom economy and reaction phase (details in the Methods section within the online supplementary material). A Python script computes scores for each pathway, filtering out only the top-ranked candidates that are research worthy for further study (Fig. S56).

Once the top pathways are identified, an LLM is employed to generate readable descriptions of the pathways in natural language. Using the pathway IDs, detailed information for each reaction step is traced from Chem-Brain. Since the original JSON format is difficult to interpret (Fig. S52), the LLM’s natural language generation capabilities are used to convert the structured data into readable chemical equations and descriptions. From various available LLMs, GPTs were chosen for this task, though other LLMs could also perform it. A tool, the relay catalysis analyzer, was developed within GPTs (Figs S3 and S7), which automatically formats the information into a clear and understandable structure. For example, the recommended pathway for crotonaldehyde (Fig. 4c) presents key conditions, a compatibility analysis of reactions and highlights for each reaction step. This process makes it easier for researchers to quickly understand the essential details of each pathway.

The pathway recommendation system developed in this study combines the traceability of reaction data from the Cat-KG, the interpretability of chemistry-based scoring rules and the language processing power of LLMs. This integration enables the system to quickly recommend high-value pathways aligned with specific goals. Unlike traditional reaction databases such as Reaxys [28] or SciFinder [29], which focus mainly on single-step reactions and molecular structure retrieval, our Cat-KG-based approach is naturally flexible and can be used in different types of catalytic applications. In this study, we focus on relay catalysis by adding customized workflows and evaluation methods to support the recommendation of multistep pathways. These include a scoring and filtering system that checks whether the catalysts and reaction conditions are suitable across different steps. Furthermore, the Cat-KG captures detailed information on catalyst synthesis and characterization—data often missing from existing resources, but essential for the development of multifunctional catalytic systems. Combined with the generative capabilities of LLMs, this structured knowledge enables the system to produce readable and traceable pathway summaries, helping chemists quickly understand, verify and explore new relay catalytic strategies.

While LLMs are good at summarizing and explaining chemical information in natural language, they still have limitations when dealing with complex chemistry [21,30,41,42]. These include bias from training data and limited access to structured, trustworthy chemical information. As a result, LLMs may produce content that appears correct, but may not be fully reliable, especially without access to trustworthy and structured data [21,42] (Section S4 within the online supplementary material). To reduce this problem, we provide the LLM with structured JSON data that includes detailed reaction information and IDs from the Cat-KG. This helps the model generate readable, traceable and more accurate descriptions. However, mistakes can still happen. So, for any pathways that researchers find valuable, the outputs are manually checked by tracing the IDs back to the original Cat-KG entries and related publications. This process ensures that the information is correct and supports further study. For more details, see Section S1. Methods, under the subsection LLM-Assisted Presentation of Relay Catalysis Pathway.

Based on the Cat-KG-based relay catalysis pathway recommendation system developed in this work, 20 new pathways were proposed for 10 valuable target compounds. These compounds include lower olefins, ethylene, ethylene glycol, oxalic acid, propylene glycol, crotonaldehyde, 1,3-butadiene, 1,4-butanediol, cis-2-butene and 2,5-furandicarboxylic acid (FDCA). These pathways have not yet been reported in the literature (see Section S5 within the online supplementary material). They provide new opportunities for future research on pathway optimization and experimental validation.

Moreover, we identified four relay catalysis pathways that have been reported and validated in the literature (Fig. 5). These include a three-step pathway for the synthesis of ethylene (Fig. 5a), which starts with methanol production from syngas [52], followed by the conversion of methanol and carbon monoxide into acetic acid [53], and ends with the hydrogenation of acetic acid to ethylene [54]. The reaction conditions for each step were obtained from high-quality catalysis journals, ensuring the pathway’s reliability. Wang and co-workers confirmed the effectiveness of this pathway, achieving 85% ethylene selectivity from the methanol intermediate [1]. We also identified two three-step pathways for the synthesis of ethanol (Fig. 5b and c). The first pathway involves the synthesis of ethanol from syngas via dimethyl ether (DME) [55], which is then converted to methyl acetate [56], and finally hydrogenated to ethanol [57]. The second pathway follows the sequence of syngas to methanol [58], methanol to acetic acid [59], and acetic acid to ethanol through hydrogenation [60]. Both have been experimentally validated as effective relay catalysis pathways [3,9]. Finally, we identified a two-step pathway for the synthesis of FDCA (Fig. 5d), which begins with the conversion of d-fructose to 5-hydroxymethylfurfural (HMF) [61], followed by the oxidation of HMF to FDCA [62]. This pathway has also been validated [51].

Figure 5.

Figure 5.

Relay catalysis pathways recommended by the KG-based recommendation system. Panel (a) displays a three-step pathway recommended for the synthesis of ethylene that is consistent with reported [1] relay catalysis pathways. Panels (b) and (c) display three-step pathways recommended for the synthesis of ethanol, consistent with reported [3,9] relay catalysis pathways. Panel (d) displays a two-step pathway recommended for the synthesis of FDCA, consistent with reported [51] relay catalysis pathways.

Taking the syngas-to-methanol conversion as an example (see Fig. 5 and Fig. S58), we discuss the effectiveness of the Cat-KG-based recommendation for relay catalysis pathways. The reported relay catalysis pathway [1] is compared with the Cat-KG-recommended three-step relay pathway from syngas to a methanol intermediate and finally to the target product ethylene. Both pathways share the same starting feedstock and intermediate, but the catalysts used in each step differ. In the methanol synthesis step, the Cat-KG-based recommendation system selects the well-established Zn–Cr oxide catalyst, which achieves a methanol yield of 92.6% at 573 K [52]. For the methanol carbonylation step, the Cat-KG recommends H-MOR-DA@C, which provides 60% acetic acid selectivity at 573 K [53], while the reported pathway uses H-MOR-DA [1]. In the acetic acid hydrogenation and decarboxylation step, the Cat-KG system adopts MoInline graphicC, achieving 90% conversion at 623 K [54], whereas the reported pathway employs ZnO–TiOInline graphic as the catalyst. Benefiting from the introduction of temperature similarity weighting in the recommendation system, all three steps are confined within a narrow temperature window of 573–623 K. This design enables each step to operate close to its optimal reaction condition and matches well with the reported pathway’s optimal relay catalysis temperature of 583 K [1].

Despite these achievements, several challenges remain in the integration of Cat-KG with GPT-4 for reaction pathway research. In the current work, each recommended reaction is treated independently. However, when combined into a full pathway, interactions between reaction conditions can influence overall catalytic performance [10]. Although the recommended pathways match reported ones in terms of feedstock, intermediates and products, the algorithm primarily focuses on selecting the best catalyst and conditions for individual steps. It does not yet account for cross-step factors such as acidity–metal activity matching, by-product management or hydrothermal stability. In contrast, the reported relay catalysis pathway from syngas to a methanol intermediate and finally to the target product ethylene [1] accounts for thermodynamic and kinetic coupling across multiple steps, as well as spatial distribution and compatibility of active sites, as shown in Fig. S58. For example, the hydrogenation of acetic acid produces a large amount of HInline graphicO, which can poison the acid sites of the upstream catalyst H-MOR–DA, especially when water accumulates in confined spaces. The ZnO–TiOInline graphic catalyst used in the reported pathway enables a simultaneous water–gas shift reaction, i.e. HInline graphicO + CO Inline graphic HInline graphic + COInline graphic, which converts water into gas-phase products in situ and mitigates deactivation, thereby maintaining the carbonylation activity in the first step. Such cross-step coupling is essential for ensuring long-term catalyst stability and high product selectivity [1]. At present, there is no efficient and universal method to intelligently recommend or determine optimal conditions such as metal valence states, catalyst ratios, proportions, particle sizes and mixing strategies. Even the most advanced LLMs cannot accurately predict these parameters, making manual experimentation necessary to explore different catalyst combinations and reaction conditions in order to identify the optimal configuration.

Looking ahead, we plan to develop an intelligent AI-driven strategy that helps recommend optimal conditions for individual reactions within a pathway, while also considering how these conditions interact. To further improve the accuracy and practicality of pathway recommendations, we also plan to introduce reinforcement learning techniques, such as interactive reinforcement learning from human feedback [63], where expert feedback on pathway preferences can be used to optimize scoring weights through reinforcement learning, making the results better aligned with actual research needs. We will explore how to incorporate synergies and mutual influences between reaction conditions. Additionally, we plan to expand the Cat-KG beyond thermal catalysis by continuously incorporating data on photocatalysis and electrocatalysis. This will help build a comprehensive KG for the catalysis field, better supporting researchers in tasks such as reaction and literature retrieval, question answering and even more complex knowledge reasoning and prediction. As our KG grows, our data-driven recommendation strategy will become increasingly effective.

CONCLUSIONS

We successfully developed an approach that combines a KG with LLMs to recommend relay catalysis pathways effectively. Our method used a Gemini-assisted workflow for extracting full-text data, resulting in structured, high-quality catalytic reaction and literature data. This enabled us to create a detailed Cat-KG, built from 15 881 catalysis literature sources and including 27 760 thermocatalytic reactions, which formed the basis of our pathway recommendation system. Using these extensive data, along with chemistry-based filtering rules and GPT-4’s understanding and natural language answering capabilities, our system quickly identified valuable relay catalysis pathways. This included four pathways that matched those reported in the literature and 20 previously unreported valuable pathways. Each pathway search and recommendation can be completed within minutes. This demonstrates that our KG and LLM-based approach can improve the efficiency of relay catalysis research and aid in discovering new reaction pathways. As we continue to expand the Cat-KG and leverage advanced AI technologies, we expect to assist chemists not only in thermal catalysis research, but also in photocatalytic and electrocatalytic applications.

Supplementary Material

nwaf271_Supplemental_File

Contributor Information

Fei Fu, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China.

Qing-Qing Li, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China.

Fangrong Wang, School of Informatics, The University of Edinburgh, Edinburgh EH8 9AB, UK.

Jie Hu, Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China.

Tian-Tian Wang, Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China.

Yun-Pei Liu, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China.

Weihong Xu, Laboratory of AI for Electrochemistry (AI4EC), Tan Kah Kee Innovation Laboratory (IKKEM), Xiamen 361100, China.

Zhili Lin, Laboratory of AI for Electrochemistry (AI4EC), Tan Kah Kee Innovation Laboratory (IKKEM), Xiamen 361100, China.

Fu-Qiang Gong, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China.

Qi-Yuan Fan, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China; School of Chemistry and Chemical Engineering, Shanxi University, Taiyuan 030006, China.

Jeff Z Pan, School of Informatics, The University of Edinburgh, Edinburgh EH8 9AB, UK.

Ye Wang, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China.

Jun Cheng, State Key Laboratory of Physical Chemistry of Solid Surface, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China; Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China; Laboratory of AI for Electrochemistry (AI4EC), Tan Kah Kee Innovation Laboratory (IKKEM), Xiamen 361100, China.

DATA AVAILABILITY

The Cat-KG constructed in this work enables the retrieval of catalytic reaction information and data (https://ai4ec.ikkem.com/apps/chembrain). The data supporting the findings of this study are available within the article and its online supplementary  material, or from the authors upon reasonable request.

FUNDING

This work was supported by the National Natural Science Foundation of China (22225302, 92161113, 21991151, 21991150 and 22021001), the Fundamental Research Funds for the Central Universities (20720220009 and 20720230090) and the Laboratory of AI for Electrochemistry (AI4EC), Tan Kah Kee Innovation Laboratory (IKKEM) (RD2023100101 and RD2022070501).

AUTHOR CONTRIBUTIONS

J.C. conceived, designed and supervised the project. F.F. and Q.-Q.L. designed and conducted data extraction, and developed the relay catalysis pathway recommendation system. F.F. and T.-T.W. completed data acquisition and processing. J.H. contributed to knowledge processing and the storage of the knowledge graph. Y.-P.L., F.-Q.G. and Q.-Y.F. designed the schema for catalytic reactions. W.X. and Z.L. completed the design and establishment of the Chem-Brain platform. F.W. and J.Z.P. provided experimental guidance and assistance in the knowledge graph construction and application. Y.W. provided help in the design and validation of the relay catalysis pathway recommendation system. All authors wrote the paper and have reviewed, discussed and approved the results and conclusions.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Chen  K, Wang  F, Wang  Y  et al.  Relay catalysis for highly selective conversion of methanol to ethylene in syngas. JACS Au  2023; 3: 2894–904. 10.1021/jacsau.3c00463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Cheng  K, Li  Y, Kang  J  et al.  Selectivity control by relay catalysis in CO and COInline graphic hydrogenation to multicarbon compounds. Acc Chem Res  2024; 57: 714–25. 10.1021/acs.accounts.3c00734 [DOI] [PubMed] [Google Scholar]
  • 3. Zhou  W, Kang  J, Cheng  K  et al.  Direct conversion of syngas into methyl acetate, ethanol, and ethylene by relay catalysis via the intermediate dimethyl ether. Angew Chem Int Ed  2018; 57: 12012–6. 10.1002/anie.201807113 [DOI] [PubMed] [Google Scholar]
  • 4. Jiao  F, Bai  B, Li  Y  et al.  Disentangling the activity-selectivity trade-off in catalytic conversion of syngas to light olefins. Science  2023; 380: 727–30. 10.1126/science.adg2491 [DOI] [PubMed] [Google Scholar]
  • 5. Pan  X, Jiao  F, Miao  D  et al.  Oxide-zeolite-based composite catalyst concept that enables syngas chemistry beyond Fischer-Tropsch synthesis. Chem Rev  2021; 121: 6588–609. 10.1021/acs.chemrev.0c01012 [DOI] [PubMed] [Google Scholar]
  • 6. Dry  ME. High quality diesel via the Fischer–Tropsch process—a review. J Chem Technol Biot  2002; 77: 43–50. 10.1002/jctb.527 [DOI] [Google Scholar]
  • 7. Dry  ME. The Fischer–Tropsch process: 1950–2000. Catal Today  2002; 7: 227–41. 10.1016/S0920-5861(01)00453-9 [DOI] [Google Scholar]
  • 8. Jiao  F, Li  J, Pan  X  et al.  Selective conversion of syngas to light olefins. Science  2016; 351: 1065–8. 10.1126/science.aaf1835 [DOI] [PubMed] [Google Scholar]
  • 9. Kang  J, He  S, Zhou  W  et al.  Single-pass transformation of syngas into ethanol with high selectivity by triple tandem catalysis. Nat Commun  2020; 11: 827. 10.1038/s41467-020-14672-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Han  S, Fan  D, Chen  N  et al.  Efficient conversion of syngas into ethanol by tandem catalysis. Acs Catal  2023; 13: 10651–60. 10.1021/acscatal.3c01577 [DOI] [Google Scholar]
  • 11. Ji  S, Pan  S, Cambria  E  et al.  A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Sys  2022; 33: 494–514. 10.1109/TNNLS.2021.3070843 [DOI] [PubMed] [Google Scholar]
  • 12. Zou  X. A survey on application of knowledge graph. J Phys Conf Ser  2020; 1487: 012016. 10.1088/1742-6596/1487/1/012016 [DOI] [Google Scholar]
  • 13. Venugopal  V, Olivetti  E. MatKG: an autonomously generated knowledge graph in material science. Sci Data  2024; 11: 217. 10.1038/s41597-024-03039-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Mikulak-Klucznik  B, Golebiowska  P, Bayly  AA  et al.  Computational planning of the synthesis of complex natural products. Nature  2020; 588: 83–8. 10.1038/s41586-020-2855-y [DOI] [PubMed] [Google Scholar]
  • 15. Szymkuc  S, Gajewska  EP, Klucznik  T  et al.  Computer-assisted synthetic planning: the end of the beginning. Angew Chem Int Ed  2016; 55: 5904–37. 10.1002/anie.201506101 [DOI] [PubMed] [Google Scholar]
  • 16. Wolos  A, Koszelewski  D, Roszak  R  et al.  Computer-designed repurposing of chemical wastes into drugs. Nature  2022; 604: 668–76. 10.1038/s41586-022-04503-9 [DOI] [PubMed] [Google Scholar]
  • 17. Gothard  CM, Soh  S, Gothard  NA  et al.  Rewiring chemistry: algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angew Chem Int Ed  2012; 51: 7922–7. 10.1002/anie.201202155 [DOI] [PubMed] [Google Scholar]
  • 18. Zhang  Z, Ma  S, Zheng  S  et al.  Semantic knowledge graph as a companion for catalyst recommendation. Natl Sci Open  2024; 3: 20230040. 10.1360/nso/20230040 [DOI] [Google Scholar]
  • 19. Fang  Y, Zhang  Q, Zhang  N  et al.  Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nat Mach Intell  2023; 5: 542–53. 10.1038/s42256-023-00654-0 [DOI] [Google Scholar]
  • 20. Menon  A, Krdzavac  NB, Kraft  M. From database to knowledge graph–using data in chemistry. Curr Opin Chem Eng  2019; 26: 33–7. 10.1016/j.coche.2019.08.004 [DOI] [Google Scholar]
  • 21. Yang  Z, Yuan  S, Shao  Z  et al.  A review on synergizing knowledge graphs and large language models. Computing  2025; 107: 143. 10.1007/s00607-025-01499-8 [DOI] [Google Scholar]
  • 22. Chen  X, Jia  S, Xiang  Y. A review: knowledge reasoning over knowledge graph  Expert Syst Appl  2020; 141: 112948. 10.1016/j.eswa.2019.112948 [DOI] [Google Scholar]
  • 23. Alam  M, Asefa Gesese  G, Paris  PH. Neurosymbolic methods for dynamic knowledge graphs. arXiv: 2409.04572. [Google Scholar]
  • 24. Hogan  A, Blomqvist  E, Cochez  M  et al.  Knowledge graphs. ACM Comput Surv  2021; 54: 71. [Google Scholar]
  • 25. Ji  S, Pan  S, Cambria  E  et al.  A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Sys  2022; 33: 494–514. 10.1109/TNNLS.2021.3070843 [DOI] [PubMed] [Google Scholar]
  • 26. Kearnes  SM, Maser  MR, Wleklinski  M  et al.  The open reaction database. J Am Chem Soc  2021; 143: 18820–6. 10.1021/jacs.1c09820 [DOI] [PubMed] [Google Scholar]
  • 27. Weber  JM, Guo  Z, Zhang  C  et al.  Chemical data intelligence for sustainable chemistry. Chem Soc Rev  2021; 50: 12013–36. 10.1039/D1CS00477H [DOI] [PubMed] [Google Scholar]
  • 28. Goodman  J. Computer software review: Reaxys. J Chem Inf Model  2009; 49: 2897–8. 10.1021/ci900437n [DOI] [Google Scholar]
  • 29. Wagner  AB. Scifinder scholar 2006: an empirical analysis of research topic query processing. J Chem Inf Model  2006; 46: 767–74. 10.1021/ci050481b [DOI] [PubMed] [Google Scholar]
  • 30. Jiang  X, Xu  C, Shen  Y  et al.  On the evolution of knowledge graphs: a survey and perspective. arXiv: 2310.04835.
  • 31. Guo  J, Ibanez-Lopez  AS, Gao  H  et al.  Automated chemical reaction extraction from scientific literature. J Chem Inf Model  2022; 62: 2035–45. 10.1021/acs.jcim.1c00284 [DOI] [PubMed] [Google Scholar]
  • 32. Swain  MC, Cole  JM. Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inf Model  2016; 56: 1894–904. 10.1021/acs.jcim.6b00207 [DOI] [PubMed] [Google Scholar]
  • 33. Wang  W, Jiang  X, Tian  S  et al.  Automated pipeline for superalloy data by text mining. Npj Comput Mater  2022; 8: 9. 10.1038/s41524-021-00687-2 [DOI] [Google Scholar]
  • 34. Brown  T, Mann  B, Ryder  N  et al.  Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems  2020; 33: 1877–901. [Google Scholar]
  • 35. Gemini Team Google . Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv: 2403.05530.
  • 36. OpenAI, Achiam  J, Adler  S  et al.  GPT-4 technical report. arXiv: 2303.08774.
  • 37. Pan  JZ, Razniewski  S, Kalo  JC  et al.  Large language models and knowledge graphs: opportunities and challenges. arXiv: 2308.06374.
  • 38. Zheng  Z, Zhang  O, Borgs  C  et al.  ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis. J Am Chem Soc  2023; 145: 18048–62. 10.1021/jacs.3c05819 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Zheng  Z, Zhang  O, Nguyen  HL  et al.  ChatGPT research group for optimizing the crystallinity of MOFs and COFs. ACS Central Sci  2023; 9: 2161–70. 10.1021/acscentsci.3c01087 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Hao  X, Ji  Z, Li  X  et al.  Construction and application of a knowledge graph. Remote Sens  2021; 13: 2511. 10.3390/rs13132511 [DOI] [Google Scholar]
  • 41. Li  H, Cao  H, Feng  B  et al.  Beyond chemical QA: evaluating LLM’s chemical reasoning with modular chemical operations. arXiv: 2505.21318.
  • 42. Mirza  A, Alampara  N, Kunchapu  S  et al.  A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat Chem  2025; 17: 1027–34. 10.1038/s41557-025-01815-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Su  Y, Wang  X, Ye  Y  et al.  Automation and machine learning augmented by large language models in a catalysis study. Chem Sci  2024; 15: 12200–33. 10.1039/D3SC07012C [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Song  T, Luo  M, Zhang  X  et al.  A multiagent-driven robotic AI chemist enabling autonomous chemical research on demand. J Am Chem Soc  2025; 147: 12534–45. 10.1021/jacs.4c17738 [DOI] [PubMed] [Google Scholar]
  • 45. Ruan  Y, Lu  C, Xu  N  et al.  An automatic end-to-end chemical synthesis development platform powered by large language models. Nat Commun  2024; 15: 10160. 10.1038/s41467-024-54457-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Lee  S, Heinen  S, Khan  D  et al.  Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials. Mach Learn Sci Technol  2024; 5: 015052. 10.1088/2632-2153/ad2f52 [DOI] [Google Scholar]
  • 47. Thik  J, Wang  S, Wang  C  et al.  Realizing the cooking recipe of materials synthesis through large language models. J Mater Chem A  2023; 11: 25849–53. 10.1039/D3TA05457H [DOI] [Google Scholar]
  • 48. Suvarna  M, Vaucher  AC, Mitchell  S  et al.  Language models and protocol standardization guidelines for accelerating synthesis planning in heterogeneous catalysis. Nat Commun  2023; 14: 7964. 10.1038/s41467-023-43836-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Zhang  W, Wang  Q, Kong  X  et al.  Fine-tuning large language models for chemical text mining. Chem Sci  2024; 15: 10600–11. 10.1039/D4SC00924J [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Francis  N, Green  A, Guagliardo  P  et al.  Cypher: an evolving query language for property graphs. In: Proceedings of the 2018 International Conference on Management of Data. New York: Association for Computing Machinery, 2018, 1433–45. 10.1145/3183713.3190657 [DOI] [Google Scholar]
  • 51. Rathod  PV, Jadhav  VH. Efficient method for synthesis of 2,5-furandicarboxylic acid from 5-hydroxymethylfurfural and fructose using Pd/CC catalyst under aqueous conditions. Acs Sustain Chem Eng  2018; 6: 5766–71. 10.1021/acssuschemeng.7b03124 [DOI] [Google Scholar]
  • 52. Tan  L, Wang  F, Zhang  P  et al.  Design of a core–shell catalyst: an effective strategy for suppressing side reactions in syngas for direct selective conversion to light olefins. Chem Sci  2020; 11: 4097–105. 10.1039/C9SC05544D [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Zhang  F, Chen  K, Jiang  Q  et al.  Selective transformation of methanol to ethanol in the presence of syngas over composite catalysts. Acs Catal  2022; 12: 8451–61. 10.1021/acscatal.2c01725 [DOI] [Google Scholar]
  • 54. Schaidle  JA, Blackburn  J, Farberow  CA  et al.  Experimental and computational investigation of acetic acid deoxygenation over oxophilic molybdenum carbide: surface chemistry and active site identity. Acs Catal  2016; 6: 1181–97. 10.1021/acscatal.5b01930 [DOI] [Google Scholar]
  • 55. Gentzen  M, Doronkin  DE, Sheppard  TL  et al.  Supported intermetallic PdZn nanoparticles as bifunctional catalysts for the direct synthesis of dimethyl ether from CO-rich synthesis gas. Angew Chem Int Ed  2019; 58: 15655–9. 10.1002/anie.201906256 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Zhou  Z, Liu  H, Ni  Y  et al.  Direct conversion of dimethyl ether and CO to acetone via coupling carbonylation and ketonization. J Catal  2021; 396: 360–73. 10.1016/j.jcat.2021.03.006 [DOI] [Google Scholar]
  • 57. Ren  Z, Younis  MN, Li  C  et al.  Highly active Ce, Y, La-modified Cu/SiOInline graphic catalysts for hydrogenation of methyl acetate to ethanol. Rsc Adv  2020; 10: 5590–603. 10.1039/C9RA08780J [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Hartadi  Y, Widmann  D, Behm  RJ. Methanol formation by COInline graphic hydrogenation on Au/ZnO catalysts–effect of total pressure and influence of CO on the reaction characteristics. J Catal  2016; 333: 238–50. 10.1016/j.jcat.2015.11.002 [DOI] [Google Scholar]
  • 59. Feng  S, Lin  X, Song  X  et al.  Insight into the stability of binuclear Ir–La catalysts for efficient heterogeneous methanol carbonylation. J Catal  2019; 377: 400–8. 10.1016/j.jcat.2019.06.050 [DOI] [Google Scholar]
  • 60. Li  W, Ye  L, Long  P  et al.  Efficient Ru–Fe catalyzed selective hydrogenolysis of carboxylic acids to alcoholic chemicals. RSC Adv  2014; 4: 29072–82. 10.1039/C4RA03201B [DOI] [Google Scholar]
  • 61. Crisci  AJ, Tucker  MH, Lee  MY  et al.  Acid-functionalized SBA-15-type silica catalysts for carbohydrate dehydration. Acs Catal  2011; 1: 719–28. 10.1021/cs2001237 [DOI] [Google Scholar]
  • 62. Casanova  O, Iborra  S, Corma  A. Biomass into chemicals: Aerobic oxidation of 5-hydroxymethyl-2-furfural into 2,5-furandicarboxylic acid with gold nanoparticle catalysts. Chemsuschem  2009; 2: 1138–44. 10.1002/cssc.200900137 [DOI] [PubMed] [Google Scholar]
  • 63. Lin  J, Ma  Z, Gomez  R  et al.  A review on interactive reinforcement learning from human social feedback. IEEE Access  2020; 8: 120757–65. 10.1109/ACCESS.2020.3006254 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

nwaf271_Supplemental_File

Data Availability Statement

The Cat-KG constructed in this work enables the retrieval of catalytic reaction information and data (https://ai4ec.ikkem.com/apps/chembrain). The data supporting the findings of this study are available within the article and its online supplementary  material, or from the authors upon reasonable request.


Articles from National Science Review are provided here courtesy of Oxford University Press

RESOURCES