Abstract
Inorganic synthesis planning currently relies primarily on heuristic approaches or machine learning models trained on limited data sets, which constrains its generality. We demonstrate that language models (LMs) without task-specific fine-tuning can recall synthesis conditions reported in the scientific literature. Off-the-shelf models, such as GPT-4.1, Gemini 2.0 Flash, and Llama 4 Maverick achieve a Top-1 precursor prediction accuracy of up to 53.8% and a Top-5 performance of 66.8% on a held-out set of 1000 reactions. They also predict calcination and sintering temperatures with mean absolute errors of <126 °C, matching or surpassing specialized regression models. Ensembling these LMs further enhances predictive accuracy and reduces inference cost per prediction by up to 70%. Given the broad, cross-domain knowledge of LMs, we evaluate whether they enable knowledge transfer by training a transformer, SyntMTE, on 28,548 LM-generated reaction recipes. Compared to a model trained on literature-reported data, we find that a model trained solely on LM-generated data exhibits competitive performance (only 6% worse). Conversely, a model trained on both the LM-generated and literature-reported data improves performance by up to 4%. In a case study on Li7La3Zr2O12 solid-state electrolytes, we demonstrate that SyntMTE reproduces the experimentally observed dopant-dependent sintering trends. Our hybrid workflow enables scalable and data-efficient inorganic synthesis planning.
Keywords: large language models, solid-state synthesis, precursor recommendation, synthesis condition prediction, synthetic data augmentation


1. Introduction
The discovery and design of advanced materials underpin progress in energy conversion and storage, information technology, and medicine. − Recent advances in machine learning (ML) accelerated simulations have driven a rapid increase in candidate materials computationally predicted, now numbering in the millions. , As a result, the synthesis of these candidates has become the principal bottleneck in the materials discovery pipeline. − Although density functional theory (DFT) provides valuable thermodynamic insight, it remains challenging to accurately predict kinetics, diffusion, or phase-transformation pathways, leaving synthesis largely a trial-and-error process. ,−
Accordingly, researchers have adopted ML methods to extract synthesis protocols from the scientific literature and predict feasible reaction pathways for novel structures. − , Foundational studies by Kononova et al., Kim et al., and Huo et al. curated comprehensive synthesis databases, thereby enabling ML-based approaches to inorganic synthesis planning. Recent efforts have focused on ML methods for two tasks: (i) precursor recommendation, i.e., identifying suitable reagent combinations; and (ii) synthesis-condition prediction, i.e., determining optimal reaction parameters. These tasks are then applied sequentially to propose a viable synthesis protocol for a target material.
1.1. Precursor Recommendation
Most precursor recommendation methods focus on solid-state synthesis. An example of LiCoO2 is illustrated in Table , where a model ranks suitable precursor combinations by their likelihood. Kim et al. used an RNN with ELMo embeddings to extract over 50,000 synthesis actions and 116,000 precursor mentions and subsequently trained a paired conditional VAE to jointly model action sequences and precursor formulas, enabling plausible precursor suggestions for novel targets. Kim introduced an element-wise retrosynthesis with 39 template classes, while He et al. developed a retrieval-based model using attention to compare planned synthesis to historical routes, extended by Noh et al. via an enthalpy-aware ranker. Prein et al. improved generalization by embedding materials with a pretrained transformer using a pairwise ranker to assess precursor suitability for unseen compounds.
1. Top-10 LiCoO2 Precursor Sets .

Predictions from GPT-4.1 assuming a standard solid-state synthesis protocol.
1.2. Synthesis Condition Prediction
After precursor selection, a second model predicts the isothermal-hold temperatures and dwell times for both the calcination and the sintering steps in solid-state synthesis routes. Huo et al. applied linear and tree-based regression on text-mined features (e.g., melting points and formation energies), achieving a mean absolute error (MAE) of approximately 140 °C. Prein et al. developed a Reaction Graph Network using MTEncoder embeddings and graph-attention layers. Pan et al. , framed condition prediction as a diffusion-based generative task conditioned on the target material’s structure, capturing the one-to-many nature of structure-synthesis relationships.
Despite the preponderance of ML methods, models are still bottlenecked by data-centric concerns. Synthesis databases remain relatively small, only rarely exceeding a few × 103 unique entries, leaving the majority of chemistries unrepresented (Figure ). , The limited size of existing data sets inhibits recovery of the true distribution of processing parameters for most materials, thereby precluding the robust mapping of a “synthesis window”, the region of temperature and isothermal dwell-time combinations that yield the desired phase. In spirit, this mapping parallels long-standing tools in experimental synthesis, most notably, phase-stability regimes in classical time-temperature transformation (TTT) diagrams. , Automated text-mining pipelines further degrade data quality by introducing extraction errors, such as misassigned stoichiometries, omitted precursor references, and conflation of precursor and target species, particularly in complex multistep protocols. Consequently, ML models trained on these sparse, noisy data sets cannot confidently resolve the underlying “synthesis window”, leading to diminished predictive accuracy and poor generalization to novel materials.
4.
t-SNE projection of inorganic precursor compositions in embedding space. (a) Compositions from the Kononova data set are shown in purple, and LM-generated compositions in orange. Compositions are represented by standardized elemental-fraction vectors and projected via t-SNE. The expanded spread of orange points indicates that our generated data set spans a larger chemical composition space than the baseline data set. (b,c) Distributions of generated processing parameters.
1.3. Language Models in Materials Synthesis
In contrast, pretrained LMs are trained on orders of magnitude more unstructured chemical knowledge: implicit heuristics, phase-diagram insights, and procedural narratives from their extensive pretraining corpora. − They have demonstrated considerable utility across scientific disciplines. − Generative LMs have excelled in crystal-structure generation: CrystaLLM and Crystal-Text-LLM produce DFT-validated geometries, while FlowLLM refines LLM-generated structures using flow matching. In synthesis planning, GPT variants fine-tuned for synthesizability and precursor prediction may match specialized models. ALDbench evaluates LMs on open-ended atomic layer deposition questions and finds that GPT-4o scores a reasonable 3.7 out of 5, resembling a passing grade. However, current state-of-the art LMs have never been systematically evaluated on generating precursors and processing conditions for solid-state inorganic synthesis, the crucial domain for materials discovery. Benchmarking LMs in this domain allows researchers to make informed decisions on LM performance, as well as guide their choice toward certain LMs. Beyond establishing a benchmark featuring current state-of-the-art models, we investigate directions of immediate relevance to materials synthesis using ML. We compare LM ensembles with single-model workflows and evaluate generated synthesis condition distributions. We probe whether LM-generated data can augment literature-mined data, i.e., whether synthetic data can serve as a prior for domain-specific models. To the best of our knowledge, these research directions have not been systematically explored in prior work. This work aims to answer the following questions:
-
1.
How well do state-of-the-art LMs perform on inorganic solid-state synthesis planning tasks?
-
2.
Do LM ensembles outperform single models, and how do they alter the distribution of proposed recipes?
-
3.
Can LM-generated synthesis recipes enrich current literature-mined sparse databases and act as an informative prior for domain-specific models?
We find that synthetic LM-generated data boosts the prediction accuracy. We leverage the models trained on the augmented data set in a case study on doped Li7La3Zr2O12 (LLZO) compounds, a functional ceramic whose cubic phase is challenging to stabilize and therefore requires careful selection of dopants and sintering steps. − Overall, our contributions are as follows:
We demonstrate strong performance of LMs in recalling previously reported synthesis trends. Moreover, we demonstrate that an ensemble of LMs surpasses individual models in precursor suggestion and processing-condition prediction, and can even accurately reconstruct synthesis condition distributions from the literature.
We show how LM-generated data can be beneficial to domain-specific expert models by curating a synthetic data set of 28,548 complete LM-generated solid-state synthesis recipes, which signifies a 6-fold increase in complete entries over existing solid-state synthesis data sets.
Leveraging both literature-mined and synthetic data, we develop a transformer-based model (SyntMTE) to regress synthesis conditions for solid-state synthesis. This model outperforms previous methods , such as CrabNet, composition-based neural networks, and decision-tree regressors, while models exhibit significant improvements up to 4% when initially trained on the synthetic data set.
2. Results and Discussion
2.1. Benchmarking LMs in Inorganic Materials Synthesis
To assess LM capabilities in inorganic synthesis planning, we deploy state-of-the-art LMs in two solid-state synthesis tasks. We use a data set derived from Kononova et al., which contains roughly 10,000 unique precursor–target material combinations. For precursor recommendation, we randomly sample a 1000 data point test data set in accordance with previous work. We submitted prompts to an LM provider (via OpenRouter) without specifying the number of precursors, thereby requiring each model to infer the appropriate precursor count for each reaction (Listing A1). For synthesis condition prediction, a 1000 entry data set is sampled by filtering for recipes containing complete sintering and calcination temperature information. For both tasks, we provide the LMs with 40 in-context examples from the held-out validation fraction of the data set (Figure S1). For precursor prediction, we evaluate the exact match accuracy. We compare 7 state-of-the-art LMs to cover a diverse set of models. Further details on the different LMs can be found in Supporting Information.
2.1.1. Precursor Recommendation
We evaluate the LMs on the precursor prediction task and report Top-k exact match accuracies. The exact match accuracy can be considered a lower bound on performance, as the model is required to reproduce precisely the precursor set reported in the literature, whereas alternative unreported synthesis routes may also exist. Because practical experimental synthesis requires a diverse set of multiple candidate precursor sets, Top-5 and Top-10 metrics are particularly informative, indicating whether a correct precursor set appears within the model’s top 5 or 10 suggestions. Table shows that all LMs deliver competitive performance and cluster within a narrow performance band. Only Qwen 2.5 VL scores lower by a margin. OpenAI GPT-4.1 leads the Top-1 ranking at 53.8% and retains good performance for high k. The model is followed by Grok 3 mini Beta and Llama 4 Maverick as well as DeepSeek Chat v3. We compare the LM results with previously published domain-specific ML models from the literature (Table S4). It is notable how the top ranked LMs outperform the domain-specific expert models by a distinct margin. However, the comparison is only partly valid. Baseline models were trained on limited data sets, while LMs may have benefited from data leakage during pretraining on test-set synthesis protocols. The best baseline reported in the literature, Synthesis Similarity, achieves Top-5 and Top-10 accuracies of 58% and 61%, respectively, while individual LMs attain scores up to 67% and 70%. This example provides a compelling demonstration that state-of-the-art LMs, without any chemistry-specific training objectives, are able to recall high-quality chemistry knowledge through in-context learning only.
2. Precursor Recommendation Performance .
| model | Top-1 ↑ | Top-3 ↑ | Top-5 ↑ | Top-10 ↑ |
|---|---|---|---|---|
| ensemble min-rank 1 | 52.3 | 65.8 | 70.7 | 74.3 |
| ensemble min-rank 2 | 51.8 | 63.1 | 67.4 | 71.9 |
| OpenAI GPT-4.1 | 53.8 | 64.1 | 66.1 | 68.7 |
| Grok 3 mini Beta | 52.2 | 63.2 | 66.8 | 69.5 |
| Llama 4 Maverick | 53.1 | 61.1 | 64.2 | 69.3 |
| DeepSeek Chat v3 | 53.5 | 60.7 | 63.7 | 66.2 |
| Mistral Small 3.1 | 52.0 | 59.7 | 61.7 | 63.9 |
| Gemini 2.0 Flash | 51.4 | 59.2 | 62.0 | 66.2 |
| Qwen 2.5 VL | 50.7 | 55.5 | 58.0 | 59.3 |
Top-k exact-match accuracies for individual LMs and ensemble strategies on retrosynthesis precursor prediction. GPT-4.1 achieves the highest Top-1 accuracy, while min-rank ensembles boost performance at higher Top-k thresholds. Notably, the ensemble of Llama 4 Maverick, DeepSeek Chat v3, and Gemini 2.0 Flash surpasses GPT-4.1 for relevant Top-5 and Top-10 settings with a 70% reduction in inference cost.
Ensemble choices: min-rank 1 combines OpenAI GPT-4.1, Llama 4 Maverick, and Grok 3 mini Beta; min-rank 2 combines Llama 4 Maverick, DeepSeek Chat v3, and Gemini 2.0 Flash.
2.1.2. LM Ensemble
We evaluate the suitability of an ensemble approach. Using performance on the validation set, we construct an ensemble of LMs comprising Grok 3 mini Beta, OpenAI GPT-4.1, and Llama 4 Maverick, and compare three aggregation strategies:
Min-rank: assign each item the best rank it received across all models, promoting any item that at least one model ranks highly.
Average-rank: compute the average rank across models, balancing contributions from all models and reducing the impact of any single outlier.
Max-rank: assign each item the worst rank it received, ensuring only items consistently favored by every model appear at the top.
We observe that the min-rank and average-rank schemes substantially improve performance at Top-3, Top-5, and Top-10 (Figure ), at the expense of a small drop in Top-1 accuracy compared to that of the best individual model. The high recall achieved by the ensemble arises from its diversity. Our observation is supported by the information-retrieval literature, where rank-fusion methods have been found to consistently improve recall by leveraging the complementary strengths of diverse rankers across topics and queries, demonstrating that greater diversity correlates with enhanced recall in ranking tasks. , Consequently, min-rank and average-rank aggregation schemes are particularly effective for precursor prediction, unlike max-rank, they exploit the complementary strengths of diverse rankers and thus improve recall.
1.
Ensemble comparison. Top-k exact match accuracy for three individual LMs: Grok 3 mini Beta, GPT-4.1, and Llama 4 Maverick and their joint ensemble with predictions combined using minimum-rank, average-rank, and maximum-rank voting. The minimum-rank ensemble achieves the best recall beyond Top-1.
2.1.3. Synthesis Condition Prediction
For the synthesis condition prediction, we evaluate LM performance in predicting parameters for a standard solid-state synthesis protocol involving two heating steps. First, precursor powders are mixed and homogenized to ensure a uniform distribution. Next, in the calcination stage, the precursor blend is calcined, activating thermal decomposition and diffusion that drive the formation of the target phase. Finally, during the sintering stage, elevated temperature promotes atomic grain–boundary and volume diffusion, which drives neck formation and growth, thereby consolidating and densifying the powder particles into a cohesive bulk body. , In order to replicate experimental workflows, we prompt the LMs to predict calcination and sintering temperatures. The generated conditions are then compared against the curated 1000 entry subset of the data published by Kononova et al. We omit the associated dwell times here, as they are found to be strongly dependent on anthropogenic factors (e.g., an experimentalist’s preference to report 24 h over 19 h). Correspondingly, prior regression-based models tend to overfit to these factors rather than the underlying thermodynamics, resulting in poor predictive performance. , In general, we note that LMs are fundamentally next-token predictors trained on a classification objective, supporting little inductive bias for regression. However, different works have focused on applying LMs to difficult numerical tasks with considerable success. In practice, synthesis temperatures are typically reported as integers (e.g., 800 °C), which are more accessible to the LMs.
Table presents the results of our experiments, comparing the performance of LMs on the synthesis condition prediction task. For calcination temperature regression, OpenAI GPT-4.1 is the best-performing model, followed by Gemini 2.0 Flash and DeepSeek Chat v3. In the sintering temperature regression, Gemini 2.0 Flash achieves the best performance, with Llama 4 Maverick and OpenAI GPT-4.1 ranking next. Grok 3 mini Beta, previously second in precursor prediction, ranks among the lowest in both regression tasks.
3. Synthesis Condition Prediction Performance .
| sintering
temperature |
calcination temperature |
|||||
|---|---|---|---|---|---|---|
| model | MAE (↓) | RMSE (↓) | R 2 (↑) | MAE (↓) | RMSE (↓) | R 2 (↑) |
| ensemble avg. 1 | 96.31 | 134.48 | 0.667 | 125.72 | 168.86 | 0.410 |
| ensemble avg. 2 | 96.89 | 135.42 | 0.663 | 123.00 | 166.93 | 0.424 |
| Gemini 2.0 Flash | 100.66 | 142.22 | 0.628 | 127.04 | 176.53 | 0.356 |
| Llama 4 Maverick | 102.76 | 145.23 | 0.612 | 135.85 | 180.90 | 0.323 |
| OpenAI GPT-4.1 | 105.21 | 150.01 | 0.586 | 125.92 | 174.45 | 0.371 |
| DeepSeek Chat v3 | 106.40 | 145.73 | 0.610 | 132.48 | 182.78 | 0.309 |
| Mistral Small 3.1 | 113.93 | 156.36 | 0.550 | 137.05 | 185.20 | 0.291 |
| Qwen 2.5 VL | 131.93 | 174.06 | 0.443 | 142.68 | 192.72 | 0.232 |
| Grok 3 mini Beta | 131.00 | 175.56 | 0.433 | 152.09 | 205.97 | 0.123 |
Regression performance for calcination and sintering temperature prediction.
Ensemble composition: ensemble 1 comprises Gemini 2.0 Flash, Llama 4 Maverick, and DeepSeek Chat v3, ensemble 2 comprises OpenAI GPT-4.1, Gemini 2.0 Flash, and DeepSeek Chat v3. A parity plot can be found in Figure S4. Temperature values in °C.
MAEs vary considerably between the tasks, with 101 °C in predicting sintering temperatures compared with an MAE of 126 °C for calcination temperature regression. This is considerable given the fact that sintering temperatures are overall of higher magnitudes. We assess the source for the elevated error in calcination temperature predictions by benchmarking both the sintering and calcination tasks against a mean-predictor baseline (Table S3). We find that the normalized standard deviation of calcination temperatures is roughly 18% higher than that of sintering temperatures (Figure S2), suggesting that calcination is a more challenging target to learn. Moreover, calcination temperature may have a stronger dependence on factors not reported in the data set, such as variations in precursor particle size, a known factor in calcination conditions. For example, Pavlović et al. report that extending ball-milling duration for BaTiO3 by 1 h reduces the required calcination temperature by over 100 °C.
Similar to the precursor recommendation task, we explore the performance of an ensemble of LMs by taking the average of the predictions of three LMs. For the sintering temperature regression, we use Llama 4 Maverick, Gemini 2.0 Flash, and DeepSeek Chat v3. Thereby we see a distinct performance improvement of 4% in R 2 over the best single-LM. For the calcination temperature regression, we add in a second ensemble by exchanging Llama with GPT-4.1. Notably, this setup is capable of increasing calcination temperature R 2 values by 5% to a reasonable 42.4%. Again, in close agreement with the precursor recommendation task, ensemble model configurations may reduce the inference cost by around 70% while boosting the performance.
We rationalize the underlying reason that an LM ensemble outperforms individual LMs. In materials synthesis, the mapping from processing recipes to target materials is inherently one-to-many. , A single composition, such as BaTiO3, can be produced through multiple annealing protocols that vary in calcination and sintering conditions, most notably in temperature and dwell time. We generate a distribution of synthesis conditions and compare them to the literature-reported synthesis of 24 BaTiO3 samples. As shown in Figure , prompting individual LMs yields narrow distributions, with a single dominating mode peaked sharply around the means (orange, upper row). An LM ensemble substantially improves overlap with the ground-truth (purple). For example, in the case of calcination temperature, it correctly predicts a secondary mode below the mean while accurately reproducing the primary mode (blue, lower row). Similarly, for sintering temperature, the ensemble distribution captures the mean and additional features of the target distribution, such as a regime near 1200 °C. Most notably, for synthesis duration, individual LMs predict around the mean with a narrow spread, while the LM ensemble more accurately capture the long tail distribution at longer processing durations. As such, LM ensembles better capture one-to-many target-synthesis relationships, which offers a key insight into why they outperform individual LMs.
2.
Synthesis condition distributions of literature-reported and LM generated solid-state synthesis recipes for BaTiO3. Literature distributions are shaded purple. Dotted lines refer to the mean value. “Single” refers to LM distributions acquired by drawing 24 samples from Gemini 2.0 Flash (orange). “Ensemble” refers to LM distributions acquired by sampling 8 predictions each from Llama 4 Maverick, DeepSeek Chat v3, and Gemini 2.0 Flash (blue). The individual LMs yield narrower distributions that fail to capture the underlying literature distribution, whereas the ensemble more accurately reproduces the literature’s secondary modes.
2.1.4. Performance vs. Cost Trade-off
To compare the overall LM performances, we normalize each model’s performance to that of the best-performing model and compute the mean normalized score. We estimate costs using input and output token rates (Figure ). GPT-4.1 and Gemini 2.0 Flash achieve the highest average performance. However, Gemini’s substantially lower price point makes it especially attractive for synthesis planning tasks. Notably, the ensemble of lower-priced models: Llama 4 Maverick, DeepSeek Chat v3, and Gemini 2.0 Flash outperform any single model while reducing cost by 70% relative to the top performing GPT-4.1. Moreover, as shown by our analysis, ensembles yield output distributions more closely aligned with the scientific literature, underscoring the joint cost and performance benefits.
3.
Comparison of model performance vs. cost. We compute each model’s relative performance on precursor prediction, calcination, and sintering temperature estimation tasks, and plot the average performance relative to cost. GPT-4.1 delivers the highest individual performance and comes at the highest cost. An ensemble of Llama 4 Maverick, DeepSeek Chat v3, and Gemini 2.0 Flash surpasses any single model in performance while reducing cost by 70% relative to GPT-4.1. The Elo rating score is represented by the color of each circle and serves as a quantitative indicator of model performance across common LM tasks. Cost estimates assume an equal proportion of input and output tokens, actual costs may vary because generated text length can differ across models.
2.2. Synthetic Data Augmentation Improves Model Performance
Our previous investigation demonstrated that LMs achieve strong performance in recalling already published synthesis parameters for solid-state synthesis. For LMs, true out-of-distribution evaluation on the 2 synthesis tasks cannot be assumed, as the training data have been publicly available for years or even decades and are therefore likely included in their training corpora. However, this result opens up the potential of leveraging LMs to generate synthetic data sets over synthesis parameter distributions to augment current size-limited experimental data sets. We evaluate the impact of LM-augmented data sets on state-of-the-art approaches in synthesis-condition prediction. This approach mirrors NLP strategies that augment scarce domain-specific data with LM-generated corpora (e.g., Xu et al.) or employ teacher–student pseudolabeling frameworks. − The rationale behind our methodology is to exploit the prior of LM learned estimates on synthesis conditions to warm-start our smaller expert models. We start by learning the LM estimates, which help our models learn the underlying broader trends, before we continue with training on experimental data.
Incorporating this workflow, we propose SyntMTE, a composition-based architecture derived from MTEncoder, a transformer model for representing inorganic materials, pretrained on the large Alexandria DFT database. , Pretraining on large DFT corpora improves learned representations and downstream performance across materials-science tasks. As in NLP and computer vision, − the pretraining objective need not be perfectly aligned with the final task, broad, physics-grounded supervision can still shape a model’s internal representation of chemical space in a way that transfers effectively. We therefore exploit the scale and coverage of publicly available DFT data sets, spanning millions of computed properties, to pretrain MTEncoder and then fine-tune it on the synthesis-related task, yielding consistent gains over models trained from scratch.
Our approach for modeling solid-state reactions extends on previous work by encoding both the reaction products and all associated precursors into embedding vectors. After embedding each material involved, we mean-pool a reaction representation and predict process parameters via multitask regression (Figure , right). We benchmark our approach on the Kononova data set using a time-based split, ranging from 2015 for training and using 2015–2016 for validation and later entries for testing. We introduce three baselines, a compositional feedforward network, a CrabNet-based transformer, and an XGBoost regressor with mean-pooled reaction features. The models are compared on three training regimes: a two-stage fine-tuning on synthetic and subsequently literature data, direct fine-tuning on literature data only, and training exclusively on synthetic data.
5.
Overview of our synthesis-condition modeling. (Left) We first adapt the MTEncoder on a large LM-generated data set to bias it toward solid-state reaction conditions, then fine-tune on experimental literature recipes. (Right) Each precursor and the target composition are encoded by the shared MTEncoder (θMTE) into embeddings , pooled and concatenated, and passed through an MLP head (θRegressor) to predict calcination and sintering temperatures .
When trained exclusively on synthetic data, all models exhibit good performance despite having no exposure to literature-mined data and being trained only on recipes that do not overlap with the literature-based test set. Notably, the R 2 value, which measures the proportion of variance in the target variable explained by the model, remains high across all models except XGBoost. We hypothesize that XGBoost overfits the synthetic data, while also learning sintering and calcination separately, forfeiting the shared inductive bias of joint training, leading to lower performance, especially for calcination temperature prediction.
Finally, we evaluate the performance under both fine-tuning regimes. Across all models, except XGBoost, augmenting the training set with synthetic data consistently enhances performance, as evidenced by the relative MAE improvements in Table . This effect is most pronounced for SyntMTE, which achieves an MAE improvement exceeding 4% across both regression tasks. CrabNet attains a 3.8% improvement. The compositional feed-forward network likewise realizes improvements of 1%, surpassing the performance of the experimental data-trained SyntMTE model.
4. Model Comparison .
| sintering
temperature |
calcination temperature |
rel. MAE imp % |
|||||||
|---|---|---|---|---|---|---|---|---|---|
| model | synth. data | literature data | MAE ↓ | RMSE ↓ | R2 ↑ | MAE ↓ | RMSE ↓ | R2 ↑ | ↑ |
| SyntMTE | √ | √ | 135.00 (0.84) | 181.30 (0.71) | 0.545 (0.004) | 153.72 (0.54) | 199.71 (0.48) | 0.436 (0.003) | 4.08 |
| SyntMTE | √ | 141.00 (2.13) | 189.27 (2.46) | 0.504 (0.013) | 160.00 (2.97) | 206.56 (2.53) | 0.395 (0.015) | 0.00 | |
| SyntMTE | √ | 149.43 (3.53) | 197.78 (1.47) | 0.428 (0.014) | 169.63 (2.46) | 214.64 (1.87) | 0.358 (0.019) | –6.00 | |
| CrabNet | √ | √ | 148.03 (1.00) | 196.88 (1.60) | 0.464 (0.009) | 159.67 (0.89) | 206.50 (1.13) | 0.397 (0.007) | 3.77 |
| CrabNet | √ | 152.87 (6.59) | 205.37 (8.20) | 0.416 (0.047) | 166.88 (3.13) | 215.80 (0.46) | 0.340 (0.025) | 0.00 | |
| CrabNet | √ | 160.41 (2.12) | 199.83 (0.69) | 0.402 (0.016) | 172.54 (4.07) | 216.66 (1.47) | 0.329 (0.036) | –4.13 | |
| Composition + NN | √ | √ | 149.75 (0.87) | 191.68 (0.78) | 0.492 (0.004) | 162.82 (0.41) | 208.44 (0.50) | 0.385 (0.003) | 0.96 |
| Composition + NN | √ | 150.23 (2.43) | 193.82 (3.21) | 0.480 (0.017) | 165.38 (3.41) | 211.54 (3.74) | 0.366 (0.022) | 0.00 | |
| Composition + NN | √ | 170.58 (2.74) | 203.32 (0.06) | 0.339 (0.016) | 176.87 (3.56) | 219.88 (0.00) | 0.315 (0.017) | –10.09 | |
| Composition + XGBoost | √ | √ | 163.62 (0.59) | 210.47 (0.73) | 0.387 (0.004) | 179.54 (0.84) | 228.00 (1.07) | 0.263 (0.007) | –13.23 |
| Composition + XGBoost | √ | 141.12 (0.90) | 189.14 (1.45) | 0.505 (0.008) | 161.96 (0.87) | 206.65 (1.05) | 0.395 (0.006) | 0.00 | |
| Composition + XGBoost | √ | 196.03 (1.90) | 242.25 (1.83) | 0.188 (0.012) | 225.74 (3.49) | 276.76 (3.48) | 0.086 (0.027) | –39.16 | |
Comparison of embedding methods on different data regimes for sintering and calcination temperatures. We report the mean across five runs with standard deviation in parentheses. √ indicates training on the respective data source. Parity plots are presented in Figure S5. Temperature values in °C.
Our experiments show that the representation-learning models SyntMTE and CrabNet benefit most from data set augmentation. This can be seen in Figure S5, where we show the parity plots of two SyntMTE models, the literature only, and the augmented model. Overall, the comparison of models trained on literature-mined versus synthetic data reveals a significant opportunity to leverage synthetic data sets in synthesis modeling. Even training exclusively on large, LM-generated synthetic data sets can achieve good performance while eliminating the need for laborious manual data extraction. We also observe that the DFT-based pretraining objective of SyntMTE substantially improves performance when only limited literature data is available.
When the scores are compared to those of the best ensembles in the LM benchmark (see Table ), the expert models exhibit lower performance across all regression tasks. Since LM benchmark scores may be inflated by data leakage, we argue that results cannot be directly compared to the year split-based predictions of SyntMTE, however, our results highlight the promising recent development of LM capabilities.
2.3. Application to Processing of LLZO Electrolyte Materials
Beyond assessing conventional performance-related material properties, virtual screening of compound-specific processing temperatures and durations offers a quantitative proxy for estimating manufacturing costs. As a case study, we consider the processing of solid-state electrolytes, which function as the replacement for conventional liquid electrolytes in next generation hybrid and future solid-state Li–ion batteries. , Their key performance metrics are ionic conductivity and the electrochemical stability window. Although oxide-based electrolytes typically outperform alternatives in those metrics, they require densification through sintering at elevated temperatures when processed in the form of free-standing electrolytes (tape or pellet). One of the most promising material candidates among solid-state electrolytes is the garnet-type solid electrolyte Li7La3Zr2O12 (LLZO), which exhibits conductivities on the order of 1 × 10–3S cm–1 at room temperature. However, widespread integration into next-generation battery architectures is still hindered by high processing costs, which originate from precursor selection and the sintering protocols required to fabricate electrolyte tapes. Densification of cubic LLZO typically demands sintering at temperatures above 1050 °C for several hours, together with the incorporation of extrinsic phase-stabilizing dopants. ,− Consequently, one of the most pressing challenges is to reduce the sintering temperature, a common requirement for high-value functional ceramics. Studies have demonstrated that aliovalent doping at the Li (A), La (B), and Zr (C) sites can lower the sintering temperature while stabilizing the desired cubic phase (Figure a). ,
6.
a) Probable doping sites in the cubic LLZO unit cell. Reproduced from reference . Available under a CC BY-NC-ND 4.0 license (https://creativecommons.org/licenses/by-nc-nd/4.0/). Copyright Mahbub et al. (b) True (blue circles) vs. predicted (orange squares) sintering temperatures with mean and standard deviation across different case reports per cation of the garnet electrolyte. Dopants grouped by crystallographic substitution site.
This interesting case provides an opportunity to evaluate the resolution of our method for predicting solid-state sintering temperatures. We curate a data set comprising 40 reported solid-state synthesis routes of doped LLZO variants in order to test if SyntMTE recovers compounds synthesizable at lower sintering temperatures. To test the extrapolation capabilities of our model, we withheld any LLZO-based compounds from the training sets and predicted the sintering temperatures of various doped LLZO compositions. Because literature-reported distributions are difficult to reconstruct precisely and several viable sintering temperatures exist per dopant family, we focus our analysis on the qualitative ordering of the compounds rather than on their exact values.
Among the C-site dopants investigated, tantalum is a well-studied yet comparatively moderate densification aid (Figure a). Supervalent substitution of Ta5+ for Zr4+ is charge-balanced by lithium vacancies (V Li), defect chemistry that accelerates lattice diffusion and stabilizes the cubic garnet framework. − Experimentally, Ta-doped LLZO pellets are sintered for a few hours at 1100–1150 °C, about 100 °C lower than the temperature needed for undoped LLZO and intermediate between the behaviors of W- and Bi-doped compositions. Our model accurately reproduces the literature synthesis temperature window, predicting a mean temperature closely matching reported values and a standard deviation reaching approximately from 1060 to 1200 °C.
Substitution of Bi3+ with Zr4+-sites induces significant lattice expansion through the larger ionic radius of Bi3+ and yields higher lithium concentrations in the lattice, thereby enhancing ionic conductivity through larger migration channels, yielding a pronounced reduction in sintering temperature reported for a single-site dopant. , Our model predicts a narrow sintering window for Bi of 880–980 °C, overestimating the minimum slightly yet accurately capturing the sharp decline to approximately 900 °C observed experimentally. For results at A- and B-site dopants, we observe Al spanning a broad sintering window owing to its small ionic radius and mixed-site occupancy, which promote gradual lattice relaxation and progressive vacancy clustering, thus smoothing the onset temperature. , In contrast, Ga3+, whose ionic radius is moderately larger than native lattice cations, yields a narrower sintering profile. Our model’s prediction for Gd exhibits the largest deviation, being overly optimistic, whereas that for Fe aligns well with experimental values.
Overall, the results show that SyntMTE, despite receiving no prior training on LLZO, reproduces the key sintering temperature trends, underscoring the potential of synthesis planning models to guide future compound selection via the virtual screening of processing temperatures.
3. Discussion
Using LMs to propose synthesis parameters shows promise. However, our benchmarks center on solid-state synthesis, one of the most common methods in inorganic chemistry. This likely resulted in prior exposure of LMs to similar or the same synthesis routes we employ during benchmarking. Assessing the true extrapolative abilities remains challenging because genuinely novel routes are published infrequently, and the composition of LM training corpora is opaque. Ultimately, rather than using LMs as the final predictor, we argue that domain-specific models remain highly valuable. They can be fitted to a laboratory’s experimental outcomes, updated as new campaigns conclude, and deployed efficiently thanks to their smaller parameter counts. Our findings further indicate that LMs function most effectively as priors for smaller models. Via lightweight prompting, they can generate large data sets, which can be used to align domain-specific models while keeping computational costs low. Our generated data set should be interpreted in this light and applied with caution. Its validity is expected to be comparable to or below the benchmarked performance on precursor selection and synthesis-condition prediction. We constructed this data set solely to expose compact models to the LM-derived priors over synthesis parameters across the chemical space and did not intend it to represent true ground-truth protocols or strictly imply realizable synthesis routes. Furthermore, we did not evaluate LM fine-tuning in this study. Nevertheless, state-of-the-art LMs already generate useful precursor candidates and conditions when conditioned via several in-context examples, without task-specific parameter adaptation, offering a practical entry point. The outlined caveats, especially the potential data leakage, warrant caution when interpreting accuracy and motivate future work on targeted adaptation and systematic evaluation on domain coverage and uncertainty, as well as the comparison of large prompt-tuned versus smaller fine-tuned models.
4. Conclusions
Existing methods in ML-based material synthesis planning remain limited by the available training data. We demonstrate that current LMs can overcome this shortcoming. We benchmark seven state-of-the-art models on two standard tasks: precursor recommendation and synthesis condition prediction. Models such as GPT-4.1, Gemini-2.0, and Llama 4 Maverick achieve top-1 exact match accuracies for precursor prediction above 50%, rising to approximately 66% for their top five suggestions. We find ensembles of LMs enhance performance further by accurately capturing the synthesis windows, meaning the multitude of processing conditions enabling the synthesis of the same target compound. In contrast, individual LMs typically yield narrower, unimodal distributions, which reflect the synthesis window less accurately. Additionally, ensembles can reduce the inference costs by as much as 70%. We then employ LMs to distill materials-related knowledge into a synthetic solid-state synthesis data set containing nearly 28,548 complete recipes. To quantify the utility of this synthetic augmentation, we develop SyntMTE, a transformer fine-tuned in two steps, first on the synthetic LM data and second on literature-based extracted data. SyntMTE outperforms existing baselines, including state-of-the-art CrabNet. Compared with training on experimental data alone, our two-step training reduces the MAE for both sintering and calcination temperatures by approximately 6 °C. In a case study, we apply our framework to doped variants of LLZO, a solid-state electrolyte whose scalability is limited by an energy-intensive sintering process. Without LLZO-specific training, our model reproduces the broad sintering windows and captures the processing effects. This includes the substantial reduction in the sintering temperature achieved by Bi-substitution. These results demonstrate the model’s potential to identify low-temperature processing routes, for example, during the screening of novel compounds. Collectively, our study confirms that language-model-based methods can generate high-quality, cost-, and time-efficient auxiliary data on readily reported parameters and phenomena throughout the inorganic materials synthesis literature. This capability is critical as data remain scarce across the domain. Ultimately such models may be used to inform Bayesian optimization and guide autonomous experimentation, thereby accelerating the discovery and scalable production of advanced materials.
5. Methods
5.1. Data Set Preparation
We use the inorganic synthesis data set curated by Kononova et al., which contains 33,343 text-mined solid-state synthesis recipes extracted from the scientific literature. After filtering for elemental consistency and removing ambiguous entries, 18,804 reactions remain, of which 9,255 are unique target–precursor pairs. For the precursor recommendation task, we randomly sample 1,000 entries for our LM evaluation. For synthesis condition prediction, we filter the data set for entries containing complete calcination and sintering temperature information, yielding another 1,000 entry subset. Both test sets are held out from the in-context examples. For the experiments on SyntMTE, we split the data chronologically, the training data include reactions reported up to 2014, the validation data spans 2015–2016, and the test data cover later reaction entries.
5.2. Language Model Evaluation
We evaluate seven state-of-the-art LMs via the OpenRouter API: OpenAI GPT-4.1, Google Gemini 2.0 Flash, Meta Llama 4 Maverick, xAI Grok 3 mini Beta, DeepSeek Chat v3, Alibaba Qwen 2.5 VL, and Mistral Small 3.1. Given logits z i over the vocabulary at step t, the next-token distribution is the temperature-scaled softmax
| 1 |
where τ → 0 approaches greedy decoding and τ = 1 recovers the model’s native distribution. For our experiments, we use τ = 0.1 to produce near-deterministic outputs while retaining minimal stochasticity for parser retries. Each task uses 40 in-context examples sampled from the validation set. Prompts, which are available in the repository and Supporting Information, explicitly specify the required output format and include comprehensive chemical validity constraints. We employ a parser-aware retry policy, where up to three attempts per query are made, and queries still failing are marked as parsing failures and counted as false predictions.
5.2.1. Precursor Recommendation Task
To ensure a consistent comparison, all generated precursor sets are normalized to canonical chemical formulas using pymatgen. Precursor sets are treated as order-invariant, and duplicates are removed across the 20 generated suggestions per target. Generated precursor sets are then compared against ground-truth precursors. We evaluate performance using a Top-k exact-match metric for k ∈ {1, 3, 5, 10}. Let N be the number of targets. For target i, let the ground-truth precursor set be G i and the model’s ranked suggestions be (S i,1, ..., S i,K). All sets are canonicalized as described above. The Top-k exact match accuracy measures the fraction of targets for which the ground-truth precursor set appears among the top k predictions
| 2 |
5.2.2. Synthesis Condition Regression Task
For temperature prediction, we prompt models to output calcination and sintering temperatures in °C with 40 in-context examples. The outputs are parsed via regex. We report MAE, RMSE, and R 2. Outputs are returned as structured JSON, parse failures are retried up to three times, and residual failures are logged. Formally, let y i denotes the ground-truth temperature and denotes the model prediction for example i, let , with N evaluated examples.
| 3 |
| 4 |
| 5 |
5.2.3. Ensemble Methods
We construct ensembles using the models’ predictions across the same 1000 sample data sets introduced for the individual models. The ensemble approach leverages the diversity of different model architectures and training corpora to improve the overall performance and calibration. For the precursor recommendation, we aggregated ranked lists using three rank-fusion strategies. Let r m(i) denote the rank of candidate i assigned by model m. We evaluate three fusion methods: (i) min-rank S(i) = minm r m(i), which promotes items that any model ranks highly; (ii) average-rank , which balances contributions from all models, and (iii) max-rank S(i) = maxm r m(i), which ensures only items consistently favored by every model appear at the top. For regression tasks, we aggregated temperature predictions by taking the mean across ensemble members.
5.3. Synthetic Data Generation
To assemble a diverse data set, we query the Materials Project, yielding 48,927 lab-synthesized compounds. We apply maximum-entropy sampling to select 10,000 target compositions, maximizing the estimated entropy of the selected set under featurization via MTEncoder representations. First, we prompt GPT-4.1 to flag and remove materials not synthesized via solid-state methods. Next, we generate precursor sets for each remaining target composition. In line with previous findings, we preserve the top three predictions per material, given the model’s robust Top–3 accuracy of 64.1% (Table ). We then predict synthesis parameters, producing 29,473 entries, excluding generated routes found to be incomplete. Incorporating minimum temperature thresholds of 300 °C for calcination and 500 °C for sintering yields 28,548 plausible solid-state recipes. Figure a shows the enhanced compositional diversity of the generated data set when compared with the literature-mined data set by Kononova et al.
5.4. SyntMTE Model Architecture and Training
SyntMTE is a transformer-based model derived from the MTEncoder framework, pretrained on the Alexandria DFT database across 12 materials properties (Table S5). It encodes each reaction by processing the target composition and all precursor materials with shared MTEncoder weights; the resulting embeddings are mean-pooled and concatenated, then passed to a two-layer MLP head for multitask regression of calcination and sintering temperatures. We fine-tuned all weights. Training uses Adam (learning rate 4.39 × 10–5) with L1 loss, batch size 25, and 200 epochs, the encoder hidden dimension is 512. Experiments were run five times each on two NVIDIA RTX A6000 GPUs.
5.5. LLZO Case Study Methodology
We study processing temperatures for LLZO garnet solid electrolytes. Reference LLZO literature is drawn from the corpus compiled by Mahbub et al. To ensure strict extrapolation, we exclude from training and validation any record whose target mentions LLZO or a commonly doped variant (Al, Ga, Ta, Nb, and W). Calcination and sintering temperatures were extracted using OpenAI’s o3 model, followed by manual spot checks. The SyntMTE model used for evaluation was fine-tuned sequentially on our synthetic recipes and then on the literature-mined corpus, after which it was applied to the mined LLZO data set. In Figure b, error bars aggregate across distinct literature routes and precursor choices, they reflect the across-route variability and do not represent model uncertainty.
Supplementary Material
The source code underlying this project is available at the GitHub repository https://github.com/Thorben010/llm_synthesis.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsami.5c11229.
Extended methods and data set curation; prompting templates and in-context examples; evaluation protocols; additional benchmarking tables and ablations; and prompt listings and reproducibility notes (PDF)
The authors declare no competing financial interest.
This paper was published ASAP on November 26, 2025, with an older version of the Supporting Information. The corrected version was reposted on December 10, 2026.
Published as part of ACS Applied Materials & Interfaces special issue “Machine Learning for Materials Chemistry”.
References
- Zhou H., Chen Q., Li G., Luo S., Song T.-b., Duan H.-S., Hong Z., You J., Liu Y., Yang Y.. Interface engineering of highly efficient perovskite solar cells. Science. 2014;345:542–546. doi: 10.1126/science.1254050. [DOI] [PubMed] [Google Scholar]
- de Jong M., Chen W., Angsten T., Jain A., Notestine R., Gamst A., Sluiter M., Krishna Ande C., van der Zwaag S., Plata J. J.. Charting the complete elastic properties of inorganic crystalline compounds. Sci. Data. 2015;2:150009. doi: 10.1038/sdata.2015.9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitesides G. M.. The’right’size in nanobiotechnology. Nat. Biotechnol. 2003;21:1161–1165. doi: 10.1038/nbt872. [DOI] [PubMed] [Google Scholar]
- Goodenough J. B., Kim Y.. Challenges for rechargeable batteries. J. Power Sources. 2011;196:6688–6694. doi: 10.1016/j.jpowsour.2010.11.074. [DOI] [Google Scholar]
- Cui Y., Zhong Z., Wang D., Wang W. U., Lieber C. M.. High performance silicon nanowire field effect transistors. Nano Lett. 2003;3:149–152. doi: 10.1021/nl025875l. [DOI] [Google Scholar]
- Jain A., Ong S. P., Hautier G., Chen W., Richards W. D., Dacek S., Cholia S., Gunter D., Skinner D., Ceder G.. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013;1:011002. doi: 10.1063/1.4812323. [DOI] [Google Scholar]
- Schmidt J., Cerqueira T. F., Romero A. H., Loew A., Jäger F., Wang H.-C., Botti S., Marques M. A.. Improving machine-learning models in materials science through large datasets. Mater. Today Phys. 2024;48:101560. doi: 10.1016/j.mtphys.2024.101560. [DOI] [Google Scholar]
- McDermott M. J., McBride B. C., Regier C. E., Tran G. T., Chen Y., Corrao A. A., Gallant M. C., Kamm G. E., Bartel C. J., Chapman K. W.. et al. Assessing thermodynamic selectivity of solid-state reactions for the predictive synthesis of inorganic materials. ACS Cent. Sci. 2023;9:1957–1975. doi: 10.1021/acscentsci.3c01051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karpovich C., Pan E., Jensen Z., Olivetti E.. Interpretable machine learning enabled inorganic reaction classification and synthesis condition prediction. Chem. Mater. 2023;35:1062–1079. doi: 10.1021/acs.chemmater.2c03010. [DOI] [Google Scholar]
- Abed J., Kim J., Shuaibi M., Wander B., Duijf B., Mahesh S., Lee H., Gharakhanyan V., Hoogland S., Irtem E.. Open Catalyst Experiments 2024 (OCx24): Bridging Experiments and Computational Models. arXiv. 2024:arXiv:2411.11783. doi: 10.48550/arXiv.2411.11783. [DOI] [Google Scholar]
- Malik S. A., Goodall R. E., Lee A. A.. Predicting the outcomes of material syntheses with deep learning. Chem. Mater. 2021;33:616–624. doi: 10.1021/acs.chemmater.0c03885. [DOI] [Google Scholar]
- Szymanski N. J., Rendy B., Fei Y., Kumar R. E., He T., Milsted D., McDermott M. J., Gallant M., Cubuk E. D., Merchant A.. et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature. 2023;624:86–91. doi: 10.1038/s41586-023-06734-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szczypiński F. T., Bennett S., Jelfs K. E.. Can we predict materials that can be synthesised? Chem. Sci. 2021;12:830–840. doi: 10.1039/D0SC04321D. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woods-Robinson R., Stevanović V., Lany S., Heinselman K. N., Horton M. K., Persson K. A., Zakutayev A.. Role of disorder in the synthesis of metastable zinc zirconium nitrides. Phys. Rev. Mater. 2022;6:043804. doi: 10.1103/PhysRevMaterials.6.043804. [DOI] [Google Scholar]
- Chen J., Cross S. R., Miara L. J., Cho J.-J., Wang Y., Sun W.. Navigating phase diagram complexity to guide robotic inorganic materials synthesis. Nat. Prod. Res., Part A. 2024;3:606–614. doi: 10.1038/s44160-024-00502-y. [DOI] [Google Scholar]
- Kononova O., Huo H., He T., Rong Z., Botari T., Sun W., Tshitoyan V., Ceder G.. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data. 2019;6:203. doi: 10.1038/s41597-019-0224-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olivetti E. A., Cole J. M., Kim E., Kononova O., Ceder G., Han T. Y.-J., Hiszpanski A. M.. Data-driven materials research enabled by natural language processing and information extraction. App. Phys. Rev. 2020;7:041317. doi: 10.1063/5.0021106. [DOI] [Google Scholar]
- Kononova O., He T., Huo H., Trewartha A., Olivetti E. A., Ceder G.. Opportunities and challenges of text mining in materials research. iScience. 2021;24:102155. doi: 10.1016/j.isci.2021.102155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun W., David N.. A critical reflection on attempts to machine-learn materials synthesis insights from text-mined literature recipes. Faraday Discuss. 2025;256:614–638. doi: 10.1039/D4FD00112E. [DOI] [PubMed] [Google Scholar]
- Kim E., Huang K., Saunders A., McCallum A., Ceder G., Olivetti E.. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 2017;29:9436–9444. doi: 10.1021/acs.chemmater.7b03500. [DOI] [Google Scholar]
- Huo H., Rong Z., Kononova O., Sun W., Botari T., He T., Tshitoyan V., Ceder G.. Semi-supervised machine-learning classification of materials synthesis procedures. npj Comput. Mater. 2019;5:62. doi: 10.1038/s41524-019-0204-1. [DOI] [Google Scholar]
- Kim E., Jensen Z., van Grootel A., Huang K., Staib M., Mysore S., Chang H.-S., Strubell E., McCallum A., Jegelka S.. et al. Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks. J. Chem. Inf. Model. 2020;60:1194–1201. doi: 10.1021/acs.jcim.9b00995. [DOI] [PubMed] [Google Scholar]
- Kim, S. ; Noh, J. ; Gu, G. H. ; Chen, S. ; Jung, Y. . Element-wise formulation of inorganic retrosynthesis: AI for Accelerated Materials Design NeurIPS 2022 Workshop.Openreview; 2022. [Google Scholar]
- He T., Huo H., Bartel C. J., Wang Z., Cruse K., Ceder G.. Precursor recommendation for inorganic synthesis by machine learning materials similarity from scientific literature. Sci. Adv. 2023;9:eadg8180. doi: 10.1126/sciadv.adg8180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee N., Na G. S., Noh H., Park C.. Retrieval-Retro: Retrieval-based Inorganic Retrosynthesis with Expert Knowledge. Adv. Neural Inf. Process. Syst. 2024;37:25375–25400. doi: 10.52202/079017-0799. [DOI] [Google Scholar]
- Prein T., Pan E., Haddouti S., Lorenz M., Jehkul J., Wilk T., Moran C., Fotiadis M. P., Toshev A. P., Olivetti E.. Retro-Rank-In: A Ranking-Based Approach for Inorganic Materials Synthesis Planning. arXiv. 2025:arXiv:2502.04289. doi: 10.48550/arXiv.2502.04289. [DOI] [Google Scholar]
- Huo H., Bartel C. J., He T., Trewartha A., Dunn A., Ouyang B., Jain A., Ceder G.. Machine-learning rationalization and prediction of solid-state synthesis conditions. Chem. Mater. 2022;34:7323–7336. doi: 10.1021/acs.chemmater.2c01293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prein, T. ; Rahmanian, F. ; Arul, K. P. ; El-Wafi, J. ; Fotiadis, M. P. ; Heimann, J. ; Weinmann, P. ; Duan, Y. ; Pan, E. ; Olivetti, E. ; Rupp, J. L. M. . Reaction Graph Networks for Inorganic Synthesis Condition Prediction of Solid State Materials: AI for Accelerated Materials DesignNeurIPS 2024 Workshop, NeurIPS; 2024. [Google Scholar]
- Pan, E. ; Kwon, S. ; Liu, S. ; Xie, M. ; Duan, Y. ; Prein, T. ; Sheriff, K. ; Roman-Leshkov, Y. ; Moliner, M. ; Gomez-Bombarelli, R. ; Olivetti, E. . A Chemically-Guided Generative Diffusion Model for Materials Synthesis Planning: AI for Accelerated Materials DesignNeurIPS 2024 Workshop, NeurIPS; 2024. [Google Scholar]
- Pan E., Kwon S., Jensen Z., Xie M., Gómez-Bombarelli R., Moliner M., Román-Leshkov Y., Olivetti E.. ZeoSyn: A comprehensive zeolite synthesis dataset enabling machine-learning rationalization of hydrothermal parameters. ACS Cent. Sci. 2024;10:729–743. doi: 10.1021/acscentsci.3c01615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Y., Chon M., Thompson C. V., Rupp J. L.. Time-temperature-transformation (TTT) diagram of battery-grade Li-garnet electrolytes for low-temperature sustainable synthesis. Angew. Chem., Int. Ed. 2023;62:e202304581. doi: 10.1002/anie.202304581. [DOI] [PubMed] [Google Scholar]
- Rupp J. L., Scherrer B., Schäuble N., Gauckler L. J.. Time–temperature–transformation (TTT) diagrams for crystallization of metal oxide thin films. Adv. Funct. Mater. 2010;20:2807–2814. doi: 10.1002/adfm.201000377. [DOI] [Google Scholar]
- Mirza A., Alampara N., Kunchapu S., Ríos-García M., Emoekabu B., Krishnan A., Gupta T., Schilling-Wilhelmi M., Okereke M., Aneesh A.. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 2025;17:1027. doi: 10.1038/s41557-025-01815-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gottweis J., Weng W.-H., Daryin A., Tu T., Palepu A., Sirkovic P., Myaskovsky A., Weissenberger F., Rong K., Tanno R.. Towards an AI co-scientist. arXiv. 2025:arXiv:2502.18864. doi: 10.48550/arXiv.2502.18864. [DOI] [Google Scholar]
- Penadés J. R., Gottweis J., He L., Patkowski J. B., Shurick A., Weng W.-H., Tu T., Palepu A., Myaskovsky A., Pawlosky A.. AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution. bioRxiv. 2025:639094. doi: 10.1101/2025.02.19.639094. [DOI] [PubMed] [Google Scholar]
- Cohrs K.-H., Diaz E., Sitokonstantinou V., Varando G., Camps-Valls G.. Large language models for causal hypothesis generation in science. Mach. learn.: sci. technol. 2025;6:013001. doi: 10.1088/2632-2153/ada47f. [DOI] [Google Scholar]
- Taylor R., Kardas M., Cucurull G., Scialom T., Hartshorn A., Saravia E., Poulton A., Kerkez V., Stojnic R.. Galactica: A large language model for science. arXiv. 2022:arXiv:2211.09085. doi: 10.48550/arXiv.2211.09085. [DOI] [Google Scholar]
- Antunes L. M., Butler K. T., Grau-Crespo R.. Crystal structure generation with autoregressive large language modeling. Nat. Commun. 2024;15:10570. doi: 10.1038/s41467-024-54639-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubungo A. N., Li K., Hattrick-Simpers J., Dieng A. B.. LLM4Mat-bench: benchmarking large language models for materials property prediction. arXiv. 2024:arXiv:2411.00177. doi: 10.48550/arXiv.2411.00177. [DOI] [Google Scholar]
- Chen R., Miller B., Sriram A., Wood B.. FlowLLM: Flow matching for material generation with large language models as base distributions. Adv. Neural Inf. Process. Syst. 2024;37:46025–46046. doi: 10.52202/079017-1464. [DOI] [Google Scholar]
- Miller, B. K. ; Chen, R. T. ; Sriram, A. ; Wood, B. M. . Flowmm: Generating materials with riemannian flow matching. In Forty-First International Conference on Machine Learning, 2024. [Google Scholar]
- Kim S., Jung Y., Schrier J.. Large language models for inorganic synthesis predictions. J. Am. Chem. Soc. 2024;146:19654–19659. doi: 10.1021/jacs.4c05840. [DOI] [PubMed] [Google Scholar]
- Yanguas-Gil A., Dearing M. T., Elam J. W., Jones J. C., Kim S., Mohammad A., Thang Nguyen C., Sengupta B.. Benchmarking large language models for materials synthesis: The case of atomic layer deposition. J. Vac. Sci. Technol. A or JVSTA. 2025;43:032406. doi: 10.1116/6.0004319. [DOI] [Google Scholar]
- Kim K. J., Balaish M., Wadaguchi M., Kong L., Rupp J. L.. Solid-state Li–metal batteries: challenges and horizons of oxide and sulfide solid electrolytes and their interfaces. Adv. Energy Mater. 2021;11:2002689. doi: 10.1002/aenm.202002689. [DOI] [Google Scholar]
- Weinmann, S. ; Quincke, L. ; Winkler, L. ; Hinricher, J. J. ; Kurnia, F. ; Kim, K. J. ; Rupp, J. L. M. . Sustainable functional ceramics. Nat. Nanotechnol. 2025. [DOI] [PubMed] [Google Scholar]
- Balaish M., Kim K. J., Chu H., Zhu Y., Gonzalez-Rosillo J. C., Kong L., Paik H., Weinmann S., Hood Z. D., Hinricher J., Miara L. J., Rupp J. L. M.. Emerging Processing Guidelines for Solid Electrolytes in the Era of Oxide-Based Solid-State Batteries. Chem. Soc. Rev. 2025;54:8925–9007. doi: 10.1039/d5cs00358j. [DOI] [PubMed] [Google Scholar]
- Wu, Y. ; Liu, L. ; Xie, Z. ; Bae, J. ; Chow, K.-H. ; Wei, W. . Promoting high diversity ensemble learning with ensemblebench. In 2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI), 2020; pp 208–217. [Google Scholar]
- Cormack, G. V. ; Clarke, C. L. ; Buettcher, S. . Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009; pp 758–759. [Google Scholar]
- Kingery, W. D. ; Bowen, H. K. ; Uhlmann, D. R. . Introduction to Ceramics; John Wiley & Sons, 1976. [Google Scholar]
- Reed, J. S. Principles of Ceramics Processing; Wiley: New York, 1995. [Google Scholar]
- Zausinger J., Pennig L., Chlodny K., Limbach V., Ketteler A., Prein T., Singh V. M., Danziger M. M., Born J.. Regress, Don’t Guess–A Regression-like Loss on Number Tokens for Language Models. arXiv. 2024:arXiv:2411.02083. doi: 10.48550/arXiv.2411.02083. [DOI] [Google Scholar]
- Cobbe K., Kosaraju V., Bavarian M., Chen M., Jun H., Kaiser L., Plappert M., Tworek J., Hilton J., Nakano R.. Training Verifiers to Solve Math Word Problems. arXiv. 2021:arXiv:2110.14168. doi: 10.48550/arXiv.2110.14168. [DOI] [Google Scholar]
- Pavlovic V. P., Stojanovic B. D., Pavlovic V. B., Marinkovic-Stanojevic Z., Zivkovic L., Ristic M. M.. Synthesis of BaTiO3 from a mechanically activated BaCO3-TiO2 system. Sci. Sinter. 2008;40:21–26. doi: 10.2298/sos0801019p. [DOI] [Google Scholar]
- Pan E., Kwon S., Liu S., Xie M., Hoffman A. J., Duan Y., Prein T., Sheriff K., Roman-Leshkov Y., Moliner M.. DiffSyn: A Generative Diffusion Approach to Materials Synthesis Planning. arXiv. 2025:arXiv:2509.17094. doi: 10.48550/arXiv.2509.17094. [DOI] [Google Scholar]
- Chiang, W.-L. ; Zheng, L. ; Sheng, Y. ; Angelopoulos, A. N. ; Li, T. ; Li, D. ; Zhu, B. ; Zhang, H. ; Jordan, M. ; Gonzalez, J. E. . Chatbot arena: An open platform for evaluating llms by human preference. In Forty-First International Conference on Machine Learning, 2024. [Google Scholar]
- Xu, B. ; Wang, Q. ; Lyu, Y. ; Dai, D. ; Zhang, Y. ; Mao, Z. . S2ynRE: Two-stage self-training with synthetic data for low-resource relation extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023; pp 8186–8207. [Google Scholar]
- ValizadehAslani T., Shi Y., Wang J., Ren P., Zhang Y., Hu M., Zhao L., Liang H.. Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data. Neurocomputing. 2024;592:127801. doi: 10.1016/j.neucom.2024.127801. [DOI] [Google Scholar]
- Pieper, T. ; Ballout, M. ; Krumnack, U. ; Heidemann, G. ; Kühnberger, K.-U. . Enhancing Small Language Models via ChatGPT and Dataset Augmentation. In International Conference on Applications of Natural Language to Information Systems, 2024; pp 269–279. [Google Scholar]
- Tran M., Pang Y., Paul D., Pandey L., Jiang K., Guo J., Li K., Zhang S., Zhang X., Lei X.. A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data. arXiv. 2025:arXiv:2501.12501. doi: 10.48550/arXiv.2501.12501. [DOI] [Google Scholar]
- Prein, T. ; Pan, E. ; Doerr, T. ; Olivetti, E. ; Rupp, J. L. . MTENCODER: A multi-task pretrained transformer encoder for materials representation learning: AI for Accelerated Materials Design-NeurIPS 2023 Workshop, 2023.
- Vaswani, A. ; Shazeer, N. ; Parmar, N. ; Uszkoreit, J. ; Jones, L. ; Gomez, A. N. ; Kaiser, Ł. ; Polosukhin, I. . Attention is all you need. Advances in neural information processing systems, 2017, 30. [Google Scholar]
- Li X., Grandvalet Y., Davoine F., Cheng J., Cui Y., Zhang H., Belongie S., Tsai Y.-H., Yang M.-H.. Transfer learning in computer vision tasks: Remember where you come from. Image Vis Comput. 2020;93:103853. doi: 10.1016/j.imavis.2019.103853. [DOI] [Google Scholar]
- Gopalakrishnan K., Khaitan S. K., Choudhary A., Agrawal A.. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr. Build. Mater. 2017;157:322–330. doi: 10.1016/j.conbuildmat.2017.09.110. [DOI] [Google Scholar]
- Wang A. Y.-T., Kauwe S. K., Murdock R. J., Sparks T. D.. Compositionally restricted attention-based network for materials property predictions. npj Comput. Mater. 2021;7:77. doi: 10.1038/s41524-021-00545-1. [DOI] [Google Scholar]
- Bauer C., Burkhardt S., Dasgupta N. P., Ellingsen L. A.-W., Gaines L. L., Hao H., Hischier R., Hu L., Huang Y., Janek J.. et al. Charging sustainable batteries. Nat. Sustainability. 2022;5:176–178. doi: 10.1038/s41893-022-00864-1. [DOI] [Google Scholar]
- Balaish M., Gonzalez-Rosillo J. C., Kim K. J., Zhu Y., Hood Z. D., Rupp J. L.. Processing thin but robust electrolytes for solid-state batteries. Nat. Energy. 2021;6:227–239. doi: 10.1038/s41560-020-00759-5. [DOI] [Google Scholar]
- Sand S. C., Rupp J. L., Yildiz B.. A critical review on Li-ion transport, chemistry and structure of ceramic–polymer composite electrolytes for solid state batteries. Chem. Soc. Rev. 2025;54:178–200. doi: 10.1039/d4cs00214h. [DOI] [PubMed] [Google Scholar]
- Mahbub R., Huang K., Jensen Z., Hood Z. D., Rupp J. L., Olivetti E. A.. Text mining for processing conditions of solid-state battery electrolytes. Electrochem. Commun. 2020;121:106860. doi: 10.1016/j.elecom.2020.106860. [DOI] [Google Scholar]
- Hood Z. D., Zhu Y., Miara L. J., Chang W. S., Simons P., Rupp J. L.. A sinter-free future for solid-state battery designs. Energy Environ. Sci. 2022;15:2927–2936. doi: 10.1039/D2EE00279E. [DOI] [Google Scholar]
- Pfenninger R., Struzik M., Garbayo I., Stilp E., Rupp J. L.. A low ride on processing temperature for fast lithium conduction in garnet solid-state battery films. Nat. Energy. 2019;4:475–483. doi: 10.1038/s41560-019-0384-4. [DOI] [Google Scholar]
- Chu, H. ; Defferriere, T. ; Nandi, P. ; Kaiser, W. ; Wolz, L. M. ; Kurnia, F. ; O’Leary, W. ; Altantzis, T. ; Verbeeck, J. ; Egger, D. A. ; Bals, S. ; Eichhorn, J. ; Tuller, H. L. ; Rupp, J. L. . Manuscript in Revision.
- Li, S. ; Weinmann, S. ; Prein, T. ; Chu, H. ; Rupp, J. L. ; others Understanding the defect chemistry and Li+ transportation of Ta-doped Li7La3Zr2Ta0. 5O12-δ by active ML learning Raman spectroscopy image. Proceedings of 24th International Conference on Solid State Ionics (SSI24). 2024. [Google Scholar]
- Morozov A. V., Paik H., Boev A. O., Aksyonov D. A., Lipovskikh S. A., Stevenson K. J., Rupp J. L., Abakumov A. M.. Thermodynamics as a Driving Factor of LiCoO2 Grain Growth on Nanocrystalline Ta-LLZO Thin Films for All-Solid-State Batteries. ACS Appl. Mater. Interfaces. 2022;14:39907–39916. doi: 10.1021/acsami.2c07176. [DOI] [PubMed] [Google Scholar]
- Han F., Zhu Y., He X., Mo Y., Wang C.. Electrochemical stability of Li10GeP2S12 and Li7La3Zr2O12 solid electrolytes. Adv. Energy Mater. 2016;6:1501590. doi: 10.1002/aenm.201501590. [DOI] [Google Scholar]
- Kim S., Jung C., Kim H., Thomas-Alyea K. E., Yoon G., Kim B., Badding M. E., Song Z., Chang J., Kim J.. et al. The Role of Interlayer Chemistry in Li-Metal Growth through a Garnet-Type Solid Electrolyte. Adv. Energy Mater. 2020;10:1903993. doi: 10.1002/aenm.201903993. [DOI] [Google Scholar]
- Schwanz D. K., Villa A., Balasubramanian M., Helfrecht B., Marinero E. E.. Bi aliovalent substitution in Li7La3Zr2O12 garnets: Structural and ionic conductivity effects. AIP Adv. 2020;10:035204. doi: 10.1063/1.5141764. [DOI] [Google Scholar]
- Wagner R., Rettenwander D., Redhammer G. J., Tippelt G., Sabathi G., Musso M. E., Stanje B., Wilkening M., Suard E., Amthauer G.. Synthesis, crystal structure, and stability of cubic Li7–x La3Zr2–x Bi x O12. Inorg. Chem. 2016;55:12211–12219. doi: 10.1021/acs.inorgchem.6b01825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Košir J., Mousavihashemi S., Suominen M., Kobets A., Wilson B. P., Rautama E.-L., Kallio T.. Supervalent doping and its effect on the thermal, structural and electrochemical properties of Li 7 La 3 Zr 2 O 12 solid electrolytes. Mater. Adv. 2024;5:5260–5274. doi: 10.1039/D4MA00119B. [DOI] [Google Scholar]
- Samson A. J., Hofstetter K., Bag S., Thangadurai V.. A bird’s-eye view of Li-stuffed garnet-type Li 7 La 3 Zr 2 O 12 ceramic electrolytes for advanced all-solid-state Li batteries. Energy Environ. Sci. 2019;12:2957–2975. doi: 10.1039/C9EE01548E. [DOI] [Google Scholar]
- Brown T., Mann B., Ryder N., Subbiah M., Kaplan J. D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A.. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020;33:1877–1901. doi: 10.18653/v1/2021.mrl-1.1. [DOI] [Google Scholar]
- Bai S.. et al. Qwen 2.5-VL Technical Report. arXiv. 2025:arXiv:2502.13923. doi: 10.48550/arXiv.2502.13923. [DOI] [Google Scholar]
- LMArena LMArena Leaderboard Overview. Online Resource, 2025. (accessed 24 may, 2025). [Google Scholar]
- AI, M. Mistral Small 3.1, 2025. https://mistral.ai/news/mistral-small-3-1 (accessed 18 may, 2025).
- Zhang D.. et al. DeepSeek-V3 Technical Report. arXiv. 2025:arXiv:2412.19437. doi: 10.48550/arXiv.2412.19437. [DOI] [Google Scholar]
- DeepSeek DeepSeek-V3–0324 Release, 2025. (accessed 24 may, 2025).
- DeepMind, G. I ntroducing Gemini 2.0: Our New AI Model for the Agentic Era. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash (accessed 18 May, 2025).
- OpenAI GPT-4: A Sneak Peek at OpenAI’s Next Generation Model, 2024. https://openai.com/index/gpt-4-1/ (accessed 18 May, 2025).
- AI, M. Llama-4: Multimodal Intelligence, 2024. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ (accessed 18 may, 2025).
- xAI Grok-3, 2024. https://x.ai/news/grok-3 (accessed 18 may, 2025).
- Goodall R. E., Lee A. A.. Predicting materials properties without crystal structure: deep representation learning from stoichiometry. Nat. Commun. 2020;11:6280. doi: 10.1038/s41467-020-19964-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, T. ; Guestrin, C. . Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 2016; pp 785–794 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code underlying this project is available at the GitHub repository https://github.com/Thorben010/llm_synthesis.






