Abstract
Training machine learning interatomic potentials that are both computationally and data-efficient is a key challenge for enabling their routine use in atomistic simulations. To this effect, we introduce franken, a scalable and lightweight transfer learning framework that extracts atomic descriptors from pre-trained graph neural networks and transfers them to new systems using random Fourier features — an efficient and scalable approximation of kernel methods. It also provides a closed-form fine-tuning strategy for general-purpose potentials such as MACE-MP0, enabling fast and accurate adaptation to new systems or levels of quantum mechanical theory with minimal hyperparameter tuning. On a benchmark dataset of 27 transition metals, franken outperforms optimized kernel-based methods in both training time and accuracy, reducing model training from tens of hours to minutes on a single GPU. We further demonstrate the framework’s strong data-efficiency by training stable and accurate potentials for bulk water and the Pt(111)/water interface using just tens of training structures. Our open-source implementation (https://franken.readthedocs.io) offers a fast and practical solution for training potentials and deploying them for molecular dynamics simulations across diverse systems.
Subject terms: Condensed-matter physics, Physical chemistry, Theoretical chemistry
Introduction
Molecular dynamics (MD) simulations are a powerful tool for obtaining atomistic insights into a wide range of processes, from material properties to chemical and catalytic reactions1. These simulations integrate the equations of motion using a model for atomic interactions — the potential energy surface (PES). Traditionally, the PES is derived from quantum mechanical calculations such as density functional theory (DFT) or from simplified empirical models fitted to experimental data or first-principles calculations. However, quantum mechanical methods, while accurate, are computationally expensive, and empirical force fields often lack transferability and accuracy. In recent years, machine learning interatomic potentials (MLIPs) have emerged as a compelling solution, striking a balance between the accuracy of DFT and the efficiency of empirical force fields. These models learn the PES via regression of large datasets of quantum mechanical calculations, thereby achieving quantum-level accuracy while enhancing the inference speed by orders of magnitude.
For extended systems, MLIPs are typically constructed expressing the total energy as a sum of local atomic contributions that depend on the surrounding environment. To ensure physical consistency, the PES must be invariant to relevant symmetries, such as roto-translation and permutations of atoms of the same chemical species. Early MLIPs relied on handcrafted descriptors to encode the local atomic environments in a way that preserves physical symmetries. Two widely used approaches are Behler-Parrinello symmetry functions2,3, and the Smooth Overlap of Atomic Positions (SOAP)4. Symmetry functions capture radial and angular correlations using predefined functional forms designed to respect rotational and translational symmetries. Although protocols to automatically select informative symmetry functions from large candidate pools exist5, this procedure becomes increasingly difficult as the number of chemical species grows, since the distinct atomic interactions that must be described increase combinatorially with the number of species. On top of this, the model architecture should also be carefully optimized in order to achieve optimal results6. SOAP represents the local atomic environment around a central atom using a smoothed atomic density projected onto a basis of radial and spherical harmonic functions. The overlap of these densities is then used to compare the environments, typically in a kernel-based7,8 framework such as Gaussian Approximation Potentials9. This approach, however, suffers from two key limitations: its dimensionality grows quadratically with the number of chemical species, as separate channels are typically used for each pairwise element combination, while its three-body nature limits the accuracy in representing highly complex interactions. More recently, the Atomic Cluster Expansion (ACE)10 has been proposed as a systematic and general framework for constructing many-body descriptors. Compared to earlier approaches, ACE offers improved scalability and flexibility by organizing atomic interactions into a hierarchy of body-order terms and employing a shared basis across chemical species. This leads to a more favorable, typically linear, scaling with the number of species. However, the number of basis functions can still become large for high body orders or complex systems, and the method relies on careful hyperparameter tuning and sparsification to remain computationally tractable.
For the class of methods outlined above, training of MLIPs can be decomposed into two main tasks: selecting a suitable representation and regressing energy and forces as functions of that representation. Following the deep learning approach, methods such as DeepMD11 are fully data-driven, using one neural network to learn the representation and another to predict target properties. Geometric graph neural networks (GNNs)12, based on message-passing schemes, go a step further by integrating representation learning directly into the model architecture. Atoms are treated as nodes in a graph, with edges connecting atoms whose pairwise distances fall within a specified cutoff. Node features are then updated iteratively by aggregating information from neighboring atoms. Repeating this process over multiple message-passing layers allows the model to construct expressive representations that capture complex many-body interactions and chemical environments, thereby tightly coupling representation learning and regression. In terms of the scaling with the number of chemical species, GNNs typically encode chemical identity via a fixed-dimensional embedding at the node level. Hence, the representation size remains constant with respect to the number of species, making GNNs particularly appealing for modeling chemically diverse systems. Moreover, GNNs offer a natural framework for encoding physical symmetries, as permutation and translational invariance are built into the graph structure and aggregation operations. Furthermore, recent advances have enabled explicit rotational equivariance through specialized architectures13. These allow modeling of vector and tensor-valued properties, and even for invariant quantities such as the potential energy, equivariant models have demonstrated improved data-efficiency and generalization14. Altogether, these features make GNN-based MLIPs a powerful and flexible class of models, unifying representation learning and regression while incorporating the symmetries of atomic systems.
Despite these advances, constructing an MLIP for a specific system remains a non-trivial task. A critical challenge is assembling a training dataset that spans all the relevant phase space15-an unattainable requirement through ab initio molecular dynamics alone. This challenge often requires the use of active learning strategies16,17 to iteratively refine training datasets and enhanced sampling techniques to capture rare events. Such events include phase transitions18,19 and chemical reactions20–23 that occur on time scales inaccessible to unbiased simulations. For MLIPs to become routinely applicable, we also need models that ensure stable and physically accurate molecular dynamic simulations. Indeed, while the goodness of MLIPs is primarily evaluated according to the prediction error it attains on a held-out dataset, the ultimate goal is to produce reliable MD simulations14,21,24. However, models with comparable force predictions on a test dataset can exhibit vastly different behaviors when deployed “in the wild”. Recently, equivariant neural networks such as Nequip25, Allegro26 or MACE27 have shown to improve the data-efficiency for molecular dynamics, allowing stable simulations already when trained with a few hundreds of samples14.
A recent paradigm shift has occurred with the emergence of large datasets of DFT calculations for specific domains, such as OpenCatalyst28–30 for catalysis and Materials Project31,32 or Alexandria for materials33. GNNs trained on these datasets such as MACE-MP034, SevenNet035, CHGNet32, M3GNet36, and MatterSim37, introduced the paradigm of general-purpose or “universal” potentials. Unlike previous approaches, where each potential was typically trained and used only for a specific system, these large models - trained over millions of configurations - have proven adaptable, offering stable simulations even for systems not explicitly included in the training set. This has prompted many efforts in assessing their reliability in a range of zero-shot applications.38–41. Even if these large MLIPs provide stable and qualitatively correct molecular dynamics, they often lack quantitative reliability for predictive applications. This can occur for different reasons. For instance, the target system may lie outside the training distribution, leading to poor generalization. Additionally, the level of DFT theory used to generate the reference data may not be sufficiently accurate for a specific system, requiring additional fine-tuning to provide a physically meaningful model. These factors highlight the fundamental need for strategies to adapt large-scale MLIPs to specific systems. Fine-tuning the weights of a universal model is a natural approach, although the optimal strategy remains unclear42. Options range from using pre-trained weights and continuing training on a new dataset38,43, selectively freezing model layers and optimizing the rest44,45, or employing multi-head architectures with replay training44. The latter strategy requires training with (part of) the original dataset, resulting in a significant computational cost.
The question of how large models can be adapted to specific systems can be framed in the broader context of transfer learning and knowledge distillation, two complementary strategies that share the overarching goal of improving efficiency by reusing pre-trained models. The former exploits information from pre-trained models to accelerate training and improve data efficiency on new tasks46, whereas the latter focuses on compressing large “teacher” models into smaller “student” models to achieve faster inference47.
In the field of MLIPs, transfer learning has been exploited to retrain models at a higher level of quantum mechanical theory with fewer calculations48–50, or to adapt models trained on a given system to a related one51. Such strategies often involve fine-tuning pre-trained models48,49 or replacing and retraining specific components such as the readout layers of a GNN50.
In our previous work52, we explored a more general transfer learning approach by first extracting intermediate representations from a pre-trained GNN on the OC20 dataset28 and then using kernel mean embeddings53 to transfer this information to new systems. This approach provided a simple and efficient solution for learning the potential energy surface in a data-limited regime and also demonstrated good transferability. However, kernel methods suffer from poor scalability in both training and inference, making this method impractical for MD simulations. To overcome these limitations, one can resort to large-scale kernel techniques. For instance, the Nyström approximation54,55 allows one to scale to very large datasets56 by approximating the kernel matrix through low-rank decompositions. Alternatively, random features57 (RFs) efficiently approximate the kernel by projecting the infinite-dimensional kernel map into a randomized finite-dimensional feature space, effectively turning kernel-based learning into a linear regression on the random features. In the specific case of MLIPs, the scalar potential can be learned alongside the forces58, exploiting automatic differentiation. Both RFs and Nyström have been leveraged to accelerate the training and inference time of kernel-based interatomic potentials9,59. However, the scalability of such approaches remains constrained by the reliance on explicit physical descriptors, which can become computationally demanding as the complexity of the system increases.
In this manuscript, we build on these ideas and propose franken, an extremely efficient and scalable transfer learning framework to build MD-capable machine learning potentials. franken focuses on training and data efficiency, enabling the rapid adaptation of pre-trained GNN representations to new systems via random features. Through extensive experiments on different systems, we demonstrate that franken achieves accuracy comparable to MLIPs based on physical descriptors, eliminates the need for extensive hyperparameter tuning thanks to careful design choices such as the multiscale approach for the kernel length-scales, and enables stable and data-efficient MD simulations with as little as tens of training samples.
Results
The transfer learning algorithm
franken’s MLIPs are created via “model surgery” between the inner layers of a pre-trained GNN and additional yet-to-train neural network blocks. Our open-source code executes this procedure seamlessly, and the end-user will work with a single final MLIP, which can be easily deployed to perform molecular dynamics simulations. In particular, the architecture of franken can be decomposed into three functionally different blocks (Fig. 1). In the following, we highlight their main features, referring to the methods for a more detailed description.
Fig. 1. The components of franken.
1) The backbone (left, red box) utilizes a pre-trained message-passing neural network to extract invariant descriptors hn(R) from atomic environments, where R represents atomic coordinates. These representations capture local chemical environments through pre-trained node features. 2) The head component (middle, blue box) introduces non-linearity through random Fourier features, transforming the GNN descriptors using randomly sampled weights W and bias b to generate features ϕn(R) via a cosine transformation. This random feature mapping approximates a Gaussian kernel. 3) The output module (right, green box) computes atomic energies through a learnable weighted sum of these features, with the total energy E being the sum across all atoms. Forces F are calculated as the negative gradient of energy with respect to atomic positions. In the bottom row, we highlight three properties enabled by franken. (A) Constant training times irrespective of the number of chemical species in the system (unlike methods based on physical descriptors, which scale at least linearly). (B) Near-ideal scalability of franken across multiple GPUs, with training time decreasing linearly as GPU count increases. (C) Faster inference compared to the original backbone GNN model, with up to 1.32 × speedup (in the case of MACE-MP0-L0, see detailed timings in Table S1).
The first block is responsible for extracting the representation of the chemical environment for each atom in a given input configuration R. Specifically, the n-th atom is associated with a vector hn(R) of SO(3)-invariant node-level descriptors from the inner layers of a pre-trained GNN (see Methods). This vector provides an effective representation of the local environment that respects relevant symmetries, with a computational and memory footprint not depending on the number of chemical species, yielding an efficient strategy even for multicomponent systems (Fig. 1A).
The second block of franken models is the core of our transfer learning approach, which transforms the pre-trained GNN descriptors hn into the final features used to predict energy and forces. To this end, we employ random Fourier features57,60,61 (RF) maps ϕn(R) = ϕ(hn(R)), a class of transformations introduced to scale kernel methods to large datasets while still preserving their key learning guarantees62,63. Specifically, random features approximate a kernel function k as a dot product in a suitable feature space:
In franken we use the trigonometric RF map (Eq. (3) and Fig. 1) that approximates the Gaussian kernel57,61 with a given length-scale (see Methods). Notably, at variance with other kernel approximation techniques55, the RF maps can be implemented as a simple neural network layer with cosine nonlinearity, making its integration into existing deep-learning pipelines extremely simple.
Thanks to this approach, the atomic energy for the n-th atom can be simply modeled as the scalar product between the RFs ϕn(R), and a learnable vector of coefficients En(R; w) = w⊤ϕn(R). The total energy is obtained by summing together the atomic contributions, while the forces are calculated as the gradient of the total energy, making it a conservative force field. As detailed in the Methods, the coefficients w are optimized based on a convex combination of energy and force errors (Eq. (6)), whose gradient can be computed analytically. This, together with the convexity of RF models, allowed us to write a closed-form expression that globally minimizes the error without the need to perform gradient descent. Furthermore, the evaluation of the minimizer can be easily parallelized over multiple GPUs, resulting in virtually linear scaling (Fig. 1B). The resulting model can then be easily deployed for molecular dynamics using the Atomistic Simulations Environment (ASE)64 suite and also the large-scale atomic/molecular massively parallel simulator (LAMMPS)65 MD engine. As a byproduct, since it is not necessary to perform a full forward pass through the GNN backbone but only up to the layers used for constructing the representation, franken models attain slightly higher inference rates compared to the full backbone GNN model (Fig. 1C).
In the following, we focus mostly on backbones obtained from MACE-based potentials27, and particularly the MACE-MP034 invariant model (L = 0). This is a general-purpose potential trained on MPTraj36, containing extended (periodic) systems drawn from the Materials Project66, calculated at the DFT level using the Perdew-Burke-Ernzerhof (PBE)67 exchange and correlation functional. The training set covers 89 elements of the periodic table, which makes (at least in principle) its use possible in many different applications, from materials to aqueous systems and from interfaces to catalytic systems. In fact, in order to use it as a backbone, the potential must have been trained on some configuration that contained the chemical species of interest. In the SI, we also compare with equivariant models (L = 1), other GNN architectures (SevenNet0), and training datasets (MACE-OFF).
TM23 dataset—speed and accuracy across the periodic table
To demonstrate the soundness of our approach, we applied it to every one of the 27 transition metals from the TM23 dataset68, comparing both training time and force prediction accuracy against the reported baselines. These are (i) FLARE69, a kernel method based on ACE descriptors, and (ii) NequIP25, an equivariant GNN architecture. The authors of the TM23 dataset poured considerable effort into training the baselines and performed extensive hyperparameter tuning to get the most accurate models possible, oftentimes sacrificing training speed for accuracy (see Fig. 3c). In the following, we report results obtained using the MACE-MP0 (L = 0)34 model as the backbone block. In the SI, we also report results obtained with the SevenNet-035 backbone (Fig. S1), a model based on the NequIP architecture which has been trained on a Materials Project-based dataset.
Fig. 3. Easy hyperparameter tuning, shown using selected elements (Ti, Au, Cu) of the TM23 dataset.
A shows how performance improves monotonically with increasing random features. In (B) we validate the effectiveness of multiscale random features that eliminate the need for precise length-scale selection. C illustrates the comparison between force predictions and training times vs model complexity for three different systems (Ti, Au, Cu), for which more detailed studies were reported in ref. 68. For each model, we report results by increasing the complexity of the model (denoted by the size of the circles). For NequIP, this corresponds to changing the tensor rank from invariant features (L = 0) to equivariant ones (L = 3 for Ti, Cu and L = 5 for Au); for FLARE, these are 2-body (B1) and 3-body (B2) representations; for franken, we changed the number of random features from 512 to 32k.
In Fig. 2, we report the training times (A), the accuracy on forces (B), and energies (C) for each one of the transition metals in the TM23 dataset. While the training times for franken are in the ballpark of a few minutes, including full hyperparameter tuning, FLARE’s and NequIP’s training times do not include hyperparameter tuning, and extend from a range of a few hours up to weeks. Furthermore, the reported NequIP training times are conservatively estimated according to the reported training procedure, making these numbers realistic in case the models converged within the first epoch of training. These considerations project the cumulative training time of FLARE and NequIP models in weeks if not months.
Fig. 2. Accuracy vs efficiency on the TM23 dataset.
A Train time across the TM23 dataset; franken MLIPs require only minutes for training compared to hours (FLARE) or weeks (NequIP). B Force percent error, as defined in ref. 68, for each element. C Energy errors across the dataset. D Accuracy across the periodic table. Mean absolute force error across the 27 transition metals of TM2368 of franken with MACE-MP0 backbone and FLARE, a kernel method using physical descriptors. For each element, the franken's MAE is reported on the top left, while FLARE's one on the bottom right; the minimum one is indicated in bold with a darker background. FLARE's results are taken from ref. 68 and refer to the ACE B2 descriptors (three-body representation), whose parameters (angular and radial fidelity, cutoff) have been optimized for each different elemental system. franken generally attains more accurate forces compared to FLARE B2, with an average 8.9% lower MAE across chemical species, while requiring no selection of physical parameters, providing an accurate representation across the periodic table.
In terms of energy/forces prediction, the NequIP baseline is unquestionably more accurate than both FLARE and franken. As discussed below and shown in Fig. 2, however, NequIP’s accuracy is highly correlated with the complexity (that is, the angular resolution L) of the model, and rapidly declines by taking more moderate, and arguably practical values of L, see Fig. 3C. FLARE and franken, on the other hand, attain similar accuracy on the forces with franken showing an edge in 24 out of the 27 elements and force errors which are on average ~10% lower, see the detailed periodic table reported in Fig. 2D. It is important to note that the FLARE results were obtained by searching, for each system, the set of ACE descriptors that more faithfully represent the local environments. This resulted in heterogeneous models with degrees of expansion of the radial basis ranging from 7 to 17, from 3 to 6 for the angular one, and with a cutoff from 4 to 7 Å. In contrast, our transfer learning approach requires no choice of system-dependent parameters, providing a general representation that is as good as the best representation based on physical descriptors. This is a considerable advantage to traditional methods, especially in the case of multi-component systems, where it is even harder to search for the optimal parameters.
On the other hand, franken’s models have very few hyperparameters, considerably reducing the overall cost of training a production-ready MLIP. The “size” of franken models is solely determined by the total number of random features D, a parameter that can be generally set to be as large as possible within memory constraints, as we observed a monotonic increase in accuracy as a function of D, see Fig. 3A.
The loss hyperparameters (energy-to-forces ratio α and the L2 regularization λ in eq. (6)) can be quickly chosen with a grid search, since they do not require rerunning the backbone. The last parameter is the kernel length-scale σ, which can have a significant influence on the final model accuracy. Grid search is inefficient as any change in σ requires recomputing all features through the backbone. To avoid this we devised a multiscale approach (see the Methods section for further details) that allows the user to specify a set of σ values instead of a single one. This approach is much less dependent on setting the length-scales precisely, since it simultaneously captures information at multiple resolutions. As shown in Fig. 3B, the multiscale approach achieves performance comparable to that of optimally tuned single-scale models. This allowed us to use the same set of values throughout all experiments on very different systems, effectively removing the need for manual tuning of σ.
We conclude our analysis by highlighting that the performance of both NequIP and FLARE changes quite dramatically with the model complexity, unlike franken, which remains stable. For example, using smaller versions of the NequIP architecture to speed up the training time can seriously impact the final force prediction error. This is clearly shown in Fig. 3C, where we study the trade-off between training time and force accuracy upon changing the model complexity. For franken, we investigated the effect of raising the number of random features from 512 up to 32k. For FLARE, we changed the body order of the ACE descriptors from 1 to 2, while for NequIP, we report the results from ref. 68 upon changing the angular resolution L. While franken is fairly insensitive with respect to the training time and final force accuracy, we see that the force MAE of NequIP doubles by going to L = 0 with training times still on the order of one day. In summary, all these results show that franken allows the MLIPs to be trained extremely efficiently while providing performance that is as accurate as that of the best kernels built on physical descriptors but without the unfavorable scaling and limitations or the need to manually tune the hyperparameters of the model.
Bulk water—data efficient potentials for molecular dynamics
Besides showing the accuracy and efficiency of franken in predicting energies and forces, the real benchmark for an MLIP is the ability to run stable and physically meaningful MD simulations. In fact, a low error on force predictions is not enough to judge simulation stability or thermodynamic accuracy14. To test franken’s performance on molecular dynamics, we started with bulk water. The motivation for choosing this system is twofold. On the one hand, it serves to test the performance of the transfer learning approach outside the training domain, as it is out of distribution with respect to the systems (inorganic crystals) used to train the MACE-MP0 backbone, unlike the TM23 dataset tested previously. On the other hand, water is highly sensitive to the details of the electronic structure method used. Indeed, the PBE exchange-correlation functional, which is used to optimize the MACE-MP0 model, yields an overstructured water with a tendency to predict a stronger hydrogen bonding network than experimentally observed70. This leads to the radial distribution function (RDF) peaks being higher than the experimental data and to a slowed mobility, with a significantly underestimated diffusion coefficient.
We therefore explored the ability of franken to transfer MACE-based representations to a dataset of DFT calculations made with a level of theory that describes water more closely to the experiments. To this end, we used the dataset curated by Montero et al.71, which is characterized by high-quality DFT calculations (performed using RPBE + D3) and a thorough sampling. Here, the authors compared the performances of Behler-Parrinello neural networks (BPNN) and kernel-based MLIPs built from physical descriptors (two and three body), concluding that they both provided practically indistinguishable results, although the NN-based models trained on smaller datasets were unstable in some simulations.
For this system, our main goal is to assess the data efficiency of franken MLIPs. To this effect, we trained the franken models on different random subsets of the training data, containing as little as 8 samples, and calculated the root mean square error (RMSE) on the test set, see the left panel of Fig. 4. For completeness, we also report the zero-shot predictions, where the energy is calculated with respect to a common reference. Comparing the errors on energy and forces against the results reported for the kernel method in ref. 71, we observe that with as little as 8 training samples franken is able to obtain a high accuracy (forces RMSE equal to 30 meV/Å), comparable to what was obtained by the reference kernel and BPNN baselines using the whole training dataset (27 meV/Å). Similarly, in the case of energy predictions, with 8 samples franken attains an RMSE of 0.4 meV/atom, surpassing the kernel baseline trained on the full training dataset (0.3 meV/atom) already with 32 samples. Energy and forces RMSEs monotonically decrease by training franken’s models on larger training sets. Comparing the learning curves of the force predictions as a function of the number of samples (top left panel of Fig. 4), we observe that franken is more data efficient than training the reference kernel models from scratch. Furthermore, since our approach can be seen as a form of fine-tuning of the MACE architecture, which replaces the original readout with a RF-based head, we compared it to a strategy in which the GNN features are kept frozen and the original readout layers are optimized (Fig. S3). This strategy achieves a force RMSE of approximately 35 meV/Å but saturates quickly as the number of training samples increases, likely due to the limited expressivity of the linear or shallow NN typically used as readout layers of the GNN. In contrast, franken consistently achieves lower errors across all dataset sizes, as the RFs provide a richer functional form while inheriting the closed-form solution (no gradient descent needed) and the favorable generalization properties of kernel methods in the low-data regime.
Fig. 4. Data efficiency on bulk water.
A Root mean squared error (RMSE) in force (top) and energy (bottom) predictions for MACE-MP0 (L = 0) zero-shot model, franken with MACE-MP0 (L = 0) backbone, and the reference kernel optimized in ref. 71. franken shows rapid improvement with only a few training samples, outperforming the kernel baseline trained on the full dataset. In the center we show a snapshot of water MD simulation produced with franken, from which structural and dynamical properties are computed. B Accuracy of molecular dynamics simulations assessed via the error on the radial distribution function (RDF, top) and diffusion coefficient (bottom). The RDF error is evaluated against the DFT data, while the diffusion from the results obtained with the kernel in ref. 71. Inset: oxygen-oxygen RDF from simulations with MACE-MP0 zero-shot and franken trained on 8 samples. The black dotted line represents the DFT reference line as reported in ref. 71. franken delivers stable and accurate RDFs with as few as 8 configurations, demonstrating strong out-of-distribution generalization and correction of the MACE-MP0 overstructured water prediction.
A competitive accuracy on energy and forces, however, is only part of the story, as we are mostly interested in the ability to produce stable and accurate molecular dynamics simulations. For this reason, we ran multiple 1 ns-long MD simulations for each model, which we used to evaluate static (radial distribution function) and dynamic (diffusion coefficient) properties. The first result is that all simulations, even those trained on 8 samples, are stable throughout their duration. On the right side of Fig. 4, we show how these MD-related observables change as a function of the number of training samples. It is remarkable that already with only 8 samples, franken is able to match the results reported in ref. 71 even for molecular dynamics, obtaining reported results indistinguishable from the reference. All this despite the fact that the zero-shot description of both properties was significantly off, overestimating the radial function and underestimating the atomic diffusivity. Nevertheless, with only a handful of samples, our transfer learning approach achieved an excellent result. In the SI, we report the full partial radial distribution function (Fig. S5), contrasted with the DFT reference data and the experimental data.
The data shown in Fig. 4 is relative to franken with MACE-MP0 backbone at L = 0. In the SI, we extend our analysis to different GNN backbones, both invariant (L = 0) and equivariant (L = 1), and using MACE-MP0 as well as MACE-OFF72 pre-trained models. The latter were trained on the SPICE73 dataset of organic molecules and also water clusters. Studying franken’s learning curves as a function of the number of random features and number of samples, we show that they are fairly independent from the chosen backbone, reflecting the generality of our transfer learning approach (Fig. S2). In the case of MACE-MP0, the gain in accuracy in using equivariant models to predict energy and forces is marginal, while it is more significant in the case of MACE-OFF. In ref. 72, the authors noted that although the MACE-OFF models were trained on small clusters of water to improve the accuracy of solvated systems, the description of water for the MACE-OFF-small invariant model was not excellent, and larger equivariant models had to be used to obtain a good description. In terms of molecular dynamics results, however, when using the MACE-OFF-small as the backbone of franken we are able to cure these shortcomings with a handful of samples, obtaining a result indistinguishable from that obtained with the heftier MACE-MP0 (Fig. S4). Our transfer learning approach thus allows the use of cheaper architectures which are more computationally efficient in inference without sacrificing accuracy. Indeed, using the backbone from MACE-MP0, franken achieves inference times of approximately 80 μs/(atom ⋅ step) which decreases to about 47 μs/atom with MACE-OFF (see Tab. S1). Although these values are slower than the 5-10 μs/atom reported for the kernel and BPNN baselines in ref. 71, it is important to note that inference speed depends strongly on system characteristics. In particular, the computational cost of kernel and BPNN approaches scales at least linearly with the number of chemical species, whereas in franken it remains independent of this factor, making it particularly advantageous for multi-component systems.
Modeling water-metal interfaces: the case of Pt(111)
To further increase the complexity of our benchmark, we extend the evaluation of our approach from isolated transition metals and bulk aqueous systems to the more challenging case of water-metal interfaces. These interfaces play a pivotal role in a broad range of scientific and technological domains, including (electro)catalysis, batteries, and corrosion processes74,75. However, accurately capturing their behavior remains a significant challenge due to the complex interplay between water structuring, surface chemistry, and long-timescale dynamics. To investigate this regime, we selected the Pt(111)/water interface as a representative test case, using the dataset developed in ref. 76. Using Behler-Parrinello neural network potentials, the authors have reported that the Pt(111)/water interface features a double adsorption layer, with a tightly bound primary layer and a more weakly bound one. The primary layer is anchored by strong Pt-O interactions, while hydrogen bonds link it to the secondary one, and an effective repulsion between adsorbed molecules creates a semi-ordered structure. To achieve these results, they created a training set of 50,000 structures obtained through an active learning process using the NN potential and MD simulations. The RMSE errors of the network in predicting energies and forces are ~1 meV/atom and ~70 meV/Å, respectively. Interestingly, the authors reported that training with a few hundred structures was not sufficient to generate stable molecular dynamics simulations and that they had to use the entire data set.
As in the previous example, we evaluated the sample efficiency of franken’s approach, starting with an assessment of the model’s accuracy in reproducing energies and forces using the MACE-MP0 invariant model. In Fig. 5 we report the learning behavior as a function of the number of training samples and the number of random features. It is worth observing that energies are very rapidly learned and are rather insensitive to the number of random features, with RMSE values immediately converging to ~1 meV/atom. In the case of forces, there are more considerations to be made. On the one hand, we observe high accuracy and data efficiency, which leads already with 10 samples to a lower RMSE than the one obtained from the reference over the whole dataset. On the other we see a monotonic improvement as the number of random features increases, saturating around ~45 meV/Å with 16k RFs. We also observe how the number of samples needed to converge accuracy increases - rightly so - with the number of parameters to be optimized: using 1024 RFs already with 50 points one achieves a result within 2 mev/Å from that over the whole dataset (63 vs 61 meV/Å), while with 24k RFs one needs to go up to 1000 points to get a similar result (46 vs 44 meV/Å), but which at the same time is also considerably more accurate in absolute terms.
Fig. 5. Data efficiency for the Pt(111)/water interface.
Learning curves for (A) energy and (B) force predictions as a function of number of samples (x-axis) and number of random features (y-axis) used for the franken model using the MACE-MP0 backbone. The first line contains the results reported in ref. 76 using the Behler-Parrinello neural network potential (which starts from 50 samples). Each cell is colored according to the RMSE on the energy and forces whose number is reported within the cell.
More importantly, we assessed its performance in capturing structural properties via molecular dynamics. In the top panel of Fig. 6, we report the density profile of oxygen along the direction perpendicular to the surface as generated with MACE-MP0 zero-shot. This profile has some correct qualitative features (such as the presence and the location of the two peaks) but lacks many quantitative ones (such as the height of the peaks and the structure of water at medium range). In the case of franken, already when trained with 10 samples, it captures most of the features of the density profile, while from 100 onwards, it correctly reproduces the reference data, including the characteristic density layering of water near the platinum surface into two distinct peaks, featuring strong and weak adsorption, respectively. The results again underscore the exceptional data efficiency of our transfer learning approach, not only in learning energy and forces but also in producing stable molecular dynamics simulations and accurate physical observables from a minimal dataset. Hence, it paves the way for the modeling of more complex and realistic interfaces, which require the inclusion of many different ingredients, from interactions with adsorbates to including defects and ions in water, which are key aspects of many relevant applications.
Fig. 6. Structural characterization of the Pt(111)/water interface.

A Oxygen density profile along the z direction normal to the Pt(111) surface, comparing predictions from a zero-shot MACE-MP0 model (solid line) to a reference obtained with an independent model trained on the full dataset (dotted line). The MACE-MP0 model captures basic features but deviates in peak heights and fine structure. B Density profiles predicted by franken models trained using 24k RFs on an increasing number of samples (solid lines, from light to dark colors); with as few as 100 samples, these models accurately match the reference. In the inset, we report a snapshot from a molecular dynamics simulation showing the structure of the two layers composing the interface, the first with a strong chemisorption and the second with a weak physisorption.
Discussion
Designing efficient MLIPs across the dimensions of data efficiency, training speed, and inference speed is essential to enable their widespread use. In this work, we introduced franken, an efficient and scalable transfer learning framework that focuses on data and training efficiency by leveraging invariant representations learned from pre-trained graph neural networks.
The reasons for the success of this approach are twofold. On the one hand, we exploit the ability of GNNs to learn, in a data-driven way, an effective representation of local environments. On the other hand, the random features approach allows this representation to be transferred to new systems in an extremely efficient way. By effectively reusing information that has already been learned, we need only a few training samples to adapt it to new systems. Of course, these two aspects are closely related, since without a good representation, the transfer learning approach would be less effective. Although in this manuscript we have mainly focused on the MACE architecture, it would be interesting to study the impact of alternative backbones on the latent representations in future work. Similarly, access to models trained on larger and diverse datasets will allow us to explore how the size and diversity of the pre-training data affect transferability.
Although franken represents a general framework for constructing MLIPs, one practical scenario is the fine-tuning of large pre-trained models to higher levels of quantum mechanical theory or systems outside their original training scope. In this context, franken offers a global, closed-form solution to the fine-tuning problem which leverages the data-efficient properties of kernel methods while requiring only a single pass on the data, in contrast to traditional fine-tuning requiring multiple epochs of gradient descent-based training. Another promising application is realistic modeling of complex catalytic systems or materials featuring numerous distinct chemical environments where franken’s computational efficiency and adaptability become particularly valuable. Even more, the sample and computational efficiency make it ideally suited for high-throughput screening workflows, enabling rapid and accurate prediction of properties across large libraries of candidate materials or catalysts. Finally, although this work focused on materials and condensed matter systems, franken’s features-such as its independence from the number of chemical species and high training efficiency-make it suitable for biomolecular systems as well. With an appropriate backbone, it can be extended to systems such as small peptides, proteins, and metalloenzymes, offering a promising route toward data-efficient force fields for biochemistry and drug discovery.
Moving forward, several developments could enhance franken’s capabilities and broaden its applications. For more complex processes such as chemical reactions, where most of the information lies in the tails of the distribution, sampling and selecting the relevant structures becomes as crucial as having data-efficient architectures77. Combining it with active learning strategies would enable efficient exploration of relevant configurations, in turn facilitating accurate modeling of rare events such as chemical reactions, phase transitions, and complex dynamical processes. Another interesting direction is exploring alternative pre-trained backbones derived from self-supervised or meta-learning techniques78,79. Such methods could yield more convenient representations, expanding the scope of applicability to different and more complex physical and chemical systems.
While in this work we focused on data efficiency and training speed, these aspects could be complemented by approaches that specifically target inference speed. For example, knowledge distillation techniques80–82 could be used to replace the backbone GNN with a lighter model trained to mimic the pre-trained “teacher”. Another complementary strategy could involve few-shot fine-tuning of the backbone prior to transfer learning to improve the representation, similar to the distillation strategy used in ref. 83, although care must be taken not to degrade the learned representations42.
Finally, extending franken’s framework beyond the prediction of energies and forces represents an exciting frontier84,85. Applying our transfer learning approach to chemical properties prediction or reaction coordinate identification86 could provide substantial advancements in computational chemistry and materials science. By adapting pre-trained representations to these distinct but related predictive tasks, franken could become a versatile tool, bridging the gap between accurate quantum mechanical calculations and practical, large-scale molecular simulations.
Methods
Extracting GNN representations
franken is able to transfer knowledge from large pre-trained GNN MLIPs to new physical systems by extracting invariant features from the inner layers of these pre-trained models. As GNN architectures usually consist of multiple hidden layers concatenated to each other, a basic strategy to get a GNN representation is to stop the forward pass to the chosen layer ℓ, and take the invariant node features for that layer as a representation. This is the approach already introduced in ref. 52, which we followed for the SevenNet-0 on the TM23 dataset. In the case of backbones from the MACE family27, instead, we extracted a representation consisting of the concatenation of the node features up to the chosen layerℓ. This strategy mimics the structure of the MACE architecture itself, where the node features from every hidden layer are first concatenated and then fed to a readout layer for atomic energy prediction. Although we focus on invariant descriptors, in cases where the GNN backbone employs equivariant message-passing schemes25,27, the invariant features of the inner layers are determined by the underlying equivariant information from previous layers. This enables franken to leverage equivariant properties indirectly while maintaining the computational simplicity of invariant features.
Before being fed to the RF block (see the second panel of Fig. 1), we standardize the GNN features by subtracting their mean and scaling them by their standard deviation. In systems with multiple chemical elements, we found that per-species standardization works particularly well – here, the mean and standard deviation used to scale the features of an atom of species s are computed only over atoms of the same element s. This standardization process is quite important in practice, as RF models work best when their inputs are properly centered and scaled. Aside from the standardization of the GNN features, the backbone is kept frozen throughout, and the only learnable parameters are the coefficients w that are linearly combined with the random features ϕ(h), see below.
Random features
Random Fourier Features offer a framework to approximate kernel functions by expressing them as scalar products in a finite-dimensional feature space:
| 1 |
To compute such feature map, Rahimi and Recht57 proposed to leverage the Bochner theorem87, which states that any continuous, stationary kernel can be represented as the Fourier transform of a non-negative measure. Hence, we can construct the kernel starting from its Fourier transform. For instance, the Gaussian kernel,
| 2 |
has a Fourier transform that is itself Gaussian, and so it can be approximated by averaging cosine functions with frequencies sampled from a standard normal distribution. In practice, this leads to a randomized feature map of the form57:
| 3 |
where each row of is drawn from a standard multivariate normal distribution, and the offset vector has components sampled uniformly from [0, 2π)57,61. Here, D is the number of random features, which defines the model complexity and controls the quality of the kernel approximation. In this work, we used orthogonal random features61, which yield improved approximation of the Gaussian kernel, by slightly modifying the sampling scheme for W. Importantly, W and b are just sampled at initialization and do not require training, making the method simple and efficient to implement.
Two observations are now in order. First, replacing the exact kernel with its finite-dimensional approximation enables learning using only D × D matrices. This avoids the quadratic or quartic scaling associated with exact kernel methods, especially in the presence of gradients. For example, computing the whole kernel matrix for both energy and force labels requires evaluating up to 9(NT)4 elements where N is the number of atoms and T the number of atomistic configurations in the dataset. In contrast, the RF-based model scales linearly with the number of training points and enables constant-time inference, making it practical even for large datasets and systems. Second, the random feature map can be interpreted as a fixed, one-layer neural network with a cosine activation, applied to the GNN-derived descriptors hn(R). This observation allows for seamless integration into existing deep learning frameworks.
Multiscale random features
In kernel-based models such as franken, the most critical hyperparameter is the kernel length-scale σ, which governs the locality of the Gaussian kernel and directly influences the resulting random features ϕn(R). Since these features are the most computationally expensive part of the model to evaluate during training, tuning σ via grid search can be costly and inefficient. To address this, we introduced multiscale random features, a simple yet effective strategy that allows the model to operate over a range of length-scales within a single representation. Rather than relying on a single, optimal σ, we construct random features that span multiple scales, thereby eliminating the need for explicit hyperparameter tuning. Specifically, given a total budget of D random features, we define a set of d length-scales {σ1, σ2, …, σd}, sampled uniformly within a predefined range. For each σj, we allocate D/d features by sampling a corresponding frequency matrix where we have absorbed the length-scale σj inside W itself. The final feature map is constructed by concatenating the individual components:
| 4 |
where each block is given by:
| 5 |
Empirically, this multiscale approach achieves comparable performance to optimally tuned single-scale models, as shown in panel (B) of Fig. 3, while significantly reducing the tuning effort.
Model optimization
To train franken we minimize a convex combination of ℓE and ℓF, the standard squared errors on energy and forces, together with a L2 regularization term on the norm of the weights ∥w∥
| 6 |
where α is a parameter which determines the relative importance of energy and forces in the loss function.
The convexity properties of RF models ensure that there exists a vector of coefficients w* which globally minimizes the loss function ℓα(w). We now compute the global minimizer in the energy-only case α = 0, as the same steps (and slightly more cumbersome calculations) lead to a similar result when forces are present α > 0. The energy-only loss over a dataset of reference configurations is
| 7 |
To minimize ℓE(w) let’s first recall that for a single configuration R,
| 8 |
where we have defined . Plugging the definition of the energy back into the loss function and taking the gradient of ℓE(w) with respect to w, one gets:
where the covariance matrix and coefficient vector are, respectively
Since the loss function ℓE(w) is strongly convex in w, it has a unique global minimizer corresponding to the solution of ∇wℓE(w) = 0
| 9 |
The linear system in Eq. (9) has a dimension equal to the number of random features D, for which a reasonable size is on the order of a few thousand, making its solution extremely fast. As it turns out, the computational bottleneck in computing w* lies in the evaluation of the random features ϕn(R) for every configuration in the training dataset. Indeed, when forces are taken into account, the descriptor calculation involves differentiating through the GNN backbone and usually comprises more than 99% of the total training time. Luckily, this process can be carried over in parallel over multiple GPUs, by assigning a subset of the training points to evaluate to each GPU. This trick speeds up the training process almost linearly with the number of GPUs, see panel (B) of Fig. 1.
Dataset and training details
We report here the training procedure for the franken’s models. Unless otherwise stated, we trained the MLIPs using the invariant MACE-MP0-L0 GNN backbone, using as atomic descriptors the node features concatenated up to the second interaction block. A multiscale Gaussian kernel was employed, using four sets of length-scales ranging from 8 to 32. Loss hyperparameters were automatically optimized via efficient grid search (does not require recalculating the covariance matrix). In particular, we tested force-energy weight ratios α from a list of values [0.01, 0.5, 0.99] and the L2 regularization λ from [10−6,...,10−11]. In the following we describe the datasets used and further training details.
TM23. This dataset, taken from ref. 68, comprises 27 elemental systems, each obtained from AIMD simulations performed at three different temperatures, going from the crystal phase to the liquid: 0.25 Tm, 0.75 Tm and 1.25 Tm. From these trajectories, 1000 structures per system were selected and calculated with the PBE exchange and correlation functional (non-spin polarized calculations). Of these, 100 structures were used for validation. The number of atoms ranged from 31 to 80 depending on the chemical system. For franken, we used the same training (2700) and validation (300) splits as in the dataset paper. Additional training runs were performed on the melt (1.25 Tm) dataset for Ti, Au, and Cu to align with published benchmarks by changing the model complexity.
Water. The training dataset was adopted from ref. 71 and comprises 1495 curated structures generated through enhanced sampling simulations and active learning. Energies and forces were calculated with DFT using the RPBE+D3 exchange-correlation functional. Data acquisition involved replica exchange molecular dynamics and Bayesian on-the-fly learning as implemented in VASP88–93, resulting in a diverse set configurations spanning a broad thermodynamic range. In addition to the main training dataset, a second dataset of 189 structures was independently generated using temperature-ramped simulations. To better assess model generalizability, we used the first dataset as training set and the second one as validation. The number of random features used is equal to 8192, unless otherwise specified for the studies as a function of the number of RFs parameters. Sample complexity studies were also performed by training on subsets of the first dataset ranging from 8 to 1495 structures.
Pt/water interface. The training dataset for the Pt(111)/water interface is generated in ref. 76, comprising 48,041 configurations of 144-atom systems (3 × 4 Pt(111) slab with 48 Pt atoms and 32 water molecules). These structures were generated iteratively starting from AIMD dataset and using active learning with a neural network-based MLP which was used to generate uncorrelated configurations from a series of 10 ns-long MD simulations at T=350 K. Energies and forces were calculated using the PBE functional with D3 van der Waals corrections. For the training of franken, sample complexity studies were performed by training models on randomly selected subsets ranging from 10 to 30,000 configurations and a validation set of 1000 structures. The effect of model complexity was explored by varying the number of random features, using values of 1024, 8192, 16,384, and 24,576.
Molecular dynamics settings
Bulk Water. Two sets of molecular dynamics (MD) simulations were performed to investigate the structural and dynamical properties of bulk water. For the calculation of radial distribution functions, simulations were carried out in the canonical (NVT) ensemble at a temperature of 325 K, using a Langevin integrator with a friction coefficient of 0.01 fs−1. The hydrogen atom mass was set to 2 atomic mass units (amu), and a timestep of 0.5 fs was used. Initial velocities were sampled from a Maxwell-Boltzmann distribution. Each system contained 64 water molecules, and five independent replicas of 100 ps were generated. As shown in ref. 71), this system size does not exhibit significant finite size effects for RDF calculations. The RDF error is calculated as the square root of the integral of the difference between the calculated RDF and the DFT reference reported in ref. 71. For the evaluation of the diffusion coefficient, larger systems consisting of 504 molecules were used to ensure convergence of the results. These simulations also employed a timestep of 0.5 fs, with hydrogen atoms assigned their physical mass (1 amu). Initial equilibration was performed in the NVT ensemble, followed by production runs in the microcanonical (NVE) ensemble for 200 ps. The absolute error between the calculated coefficient and the one reported for the kernel ref. 71 is reported.
Pt/Water Interface. To investigate the behavior of water at a platinum interface due to the longer timescale involved, NVT simulations were carried out for 1 ns at a temperature of 350 K. A timestep of 0.5 fs was employed, with hydrogen atoms having a mass of 2 amu. The temperature was controlled using a Nosé-Hoover thermostat with a coupling constant of 2 ps, and a harmonic restraint was used to prevent the water molecules from diffusing to the other side of the metallic slab. To compute the sample efficiency of MD observables, for each number of random features and number of subsamples we performed 5 MD simulations of 1 ns each. To ensure the same computational setup was used to compare with the reference density profile, we used an independent GNN model trained with MACE on the full training set, since we were not able to produce stable simulations with the NN potential released with the dataset.
Supplementary information
Acknowledgements
We thank Michele Bianchi, Simone Perego, and Enrico Trizio for feedback and discussions. We thank Ruohan Wang for suggesting the multiscale RF approach. We acknowledge the support of the Data Science and Computation Facility at the Fondazione Istituto Italiano di Tecnologia and the CINECA award under the ISCRA initiative. This work was partially funded by the European Union—NextGenerationEU initiative and the Italian National Recovery and Resilience Plan (PNRR) from the Ministry of University and Research (MUR), under Project PE0000013 CUP J53C22003010006 “Future Artificial Intelligence Research (FAIR)”. L.R. acknowledges the financial support of the European Research Council (grant SLING 819789), the European Commission (Horizon Europe grant ELIAS 101120237), the Ministry of Education, University and Research (FARE grant ML4IP R205T7J2KP; grant BAC FAIR PE00000013 CUP J33C24000420007 funded by the EU—NGEU).
Author contributions
L.B., P.N., and M.Po. designed the project; P.J.B. prepared the prototype of the method; G.M., P.N. and P.J.B. and L.B. developed the software and performed the simulations; and all authors analyzed the results. L.B. and P.N. wrote the first draft of the manuscript, and all authors edited it.
Data availability
Data used to train interatomic potentials was obtained from https://archive.materialscloud.org/record/2024.48, https://zenodo.org/records/10723405 and https://figshare.com/articles/dataset/Dataset_and_training_files_for_Is_the_water_Pt_111_interface_ordered_at_room_temperature_/14791755?file=29141586.
Code availability
Our implementation of franken is open source and is available at the following link: https://github.com/CSML-IIT-UCL/franken. The software can be easily installed via pip (pip install franken). The documentation is available here: https://franken.readthedocs.io/, which also includes a tutorial that can be run on Google Colab to train the model and deploy it for molecular dynamics. The implementation is based on PyTorch94. To maximize performance and minimize the memory footprint, we wrote custom CUDA kernels for the computation of the covariance, and support multi-GPU training through torch.distributed. Currently, MACE and SevenNet backbones are supported. franken potentials are interfaced both with the Atomistic Simulation Environment (ASE)64, and, in the case of the MACE backbones, also with LAMMPS65.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41524-025-01779-z.
References
- 1.Frenkel, D. & Smit, B. Understanding molecular simulation: from algorithms to applications (Elsevier, 2023).
- 2.Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett.98, 146401 (2007). [DOI] [PubMed] [Google Scholar]
- 3.Behler, J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys.134, 074106 (2011). [DOI] [PubMed] [Google Scholar]
- 4.Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B Condens. Matter Mater. Phys.87, 184115 (2013). [Google Scholar]
- 5.Imbalzano, G. et al. Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials. J. Chem. Phys.148, 241730 (2018). [DOI] [PubMed] [Google Scholar]
- 6.Kývala, L. & Dellago, C. Optimizing the architecture of Behler-Parrinello neural network potentials. J. Chem. Phys159, 094105 (2023). [DOI] [PubMed] [Google Scholar]
- 7.Schölkopf, B. & Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond (MIT Press, 2002).
- 8.Shawe-Taylor, J. & Cristianini, N. Kernel methods for pattern analysis (Cambridge University Press, 2004).
- 9.Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett.104, 136403 (2010). [DOI] [PubMed] [Google Scholar]
- 10.Drautz, R. Atomic cluster expansion for accurate and transferable interatomic potentials. Phys. Rev. B99, 014104 (2019). [Google Scholar]
- 11.Zhang, L., Han, J., Wang, H., Car, R. & E, W. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett.120, 143001 (2018). [DOI] [PubMed] [Google Scholar]
- 12.Reiser, P. et al. Graph neural networks for materials science and chemistry. Commun. Mater.3, 93 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Batatia, I. et al. The design space of e (3)-equivariant atom-centred interatomic potentials. Nat. Machine Intell. 1–12 (2025). [DOI] [PMC free article] [PubMed]
- 14.Fu, X. et al. Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations. Transactions on Machine Learning Research (2023).
- 15.Kulichenko, M. et al. Data generation for machine learning interatomic potentials and beyond. Chem. Rev.124, 13681–13714 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Jinnouchi, R., Miwa, K., Karsai, F., Kresse, G. & Asahi, R. On-the-fly active learning of interatomic potentials for large-scale atomistic simulations. J. Phys. Chem. Lett.11, 6946–6955 (2020). [DOI] [PubMed] [Google Scholar]
- 17.Schran, C., Brezina, K. & Marsalek, O. Committee neural network potentials control generalization errors and enable active learning. J. Chem. Phys.153 (2020). [DOI] [PubMed]
- 18.Bonati, L. & Parrinello, M. Silicon liquid structure and crystal nucleation from ab initio deep metadynamics. Phys. Rev. Lett.121, 265701 (2018). [DOI] [PubMed] [Google Scholar]
- 19.Abou El Kheir, O., Bonati, L., Parrinello, M. & Bernasconi, M. Unraveling the crystallization kinetics of the Ge2Sb2Te5 phase change compound with a machine-learned interatomic potential. npj Comput. Mater.10, 33 (2024). [Google Scholar]
- 20.Yang, M., Bonati, L., Polino, D. & Parrinello, M. Using metadynamics to build neural network potentials for reactive events: the case of urea decomposition in water. Catal. Today387, 143–149 (2022). [Google Scholar]
- 21.Guan, X., Heindel, J. P., Ko, T., Yang, C. & Head-Gordon, T. Using machine learning to go beyond potential energy surface benchmarking for chemical reactivity. Nat. Comput. Sci.3, 965–974 (2023). [DOI] [PubMed] [Google Scholar]
- 22.Bonati, L. et al. The role of dynamics in heterogeneous catalysis: surface diffusivity and N2 decomposition on Fe (111). Proc. Natl. Acad. Sci. USA120, e2313023120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.David, R. et al. Arcann: automated enhanced sampling generation of training sets for chemically reactive machine learning interatomic potentials. Digit. Discov.4, 54–72 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Stocker, S., Gasteiger, J., Becker, F., Günnemann, S. & Margraf, J. T. How robust are modern graph neural network potentials in long and hot molecular dynamics simulations? Mach. Learn. Sci. Technol.3, 045010 (2022). [Google Scholar]
- 25.Batzner, S. et al. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun.13, 2453 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Musaelian, A. et al. Learning local equivariant representations for large-scale atomistic dynamics. Nat. Commun.14, 579 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Batatia, I., Kovacs, D. P., Simm, G., Ortner, C. & Csányi, G. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Adv. Neural Inf. Process. Syst.35, 11423–11436 (2022). [Google Scholar]
- 28.Chanussot, L. et al. Open catalyst 2020 (OC20) dataset and community challenges. Acs Catal.11, 6059–6072 (2021). [Google Scholar]
- 29.Tran, R. et al. The open catalyst 2022 (OC22) dataset and challenges for oxide electrocatalysts. ACS Catal.13, 3066–3084 (2023). [Google Scholar]
- 30.Barroso-Luque, L. et al. Open materials 2024 (omat24) inorganic materials dataset and models. arXiv preprint arXiv:2410.12771 (2024).
- 31.Jain, A. et al. Commentary: The materials project: a materials genome approach to accelerating materials innovation. APL Mater1, 011002 (2013). [Google Scholar]
- 32.Deng, B. et al. Chgnet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat. Mach. Intell.5, 1031–1041 (2023). [Google Scholar]
- 33.Ghahremanpour, M. M. et al. The Alexandria Library, a quantum-chemical database of molecular properties for force field development. Sci. Data (2018). [DOI] [PMC free article] [PubMed]
- 34.Batatia, I. et al. A foundation model for atomistic materials chemistry. arXiv preprint arXiv:2401.00096 (2023).
- 35.Park, Y., Kim, J., Hwang, S. & Han, S. Scalable parallel algorithm for graph neural network interatomic potentials in molecular dynamics simulations. J. Chem. Theory Comput.20, 4857–4868 (2024). [DOI] [PubMed] [Google Scholar]
- 36.Chen, C.-H., Chen, C., Ong, S. P. & Ong, S. P. A universal graph deep learning interatomic potential for the periodic table. Nat. Comput. Sci.2, 718–728 (2022). [DOI] [PubMed] [Google Scholar]
- 37.Yang, H. et al. Mattersim: a deep learning atomistic model across elements, temperatures and pressures. arXiv preprint arXiv:2405.04967 arXiv: 2405.04967 (2024).
- 38.Póta, B., Ahlawat, P., Csányi, G. & Simoncelli, M. Thermal conductivity predictions with foundation atomistic models. arXiv.org arXiv: 2408.00755 (2024).
- 39.Peng, J., Damewood, J. K., Karaguesian, J., Lunger, J. R. & G’omez-Bombarelli, R. Learning ordering in crystalline materials with symmetry-aware graph neural networks. arXiv.org (2024).
- 40.Shiota, T., Ishihara, K., Do, T. M., Mori, T. & Mizukami, W. Taming multi-domain, -fidelity data: Towards foundation models for atomistic scale simulations. arXiv.org arXiv: 2412.13088 (2024).
- 41.Wang, Z. et al. Advances in high-pressure materials discovery enabled by machine learning. Matter Radiation Extremes10, 033801 (2025). [Google Scholar]
- 42.Kumar, A., Raghunathan, A., Jones, R., Ma, T. & Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. Int. Conf. Learn. Represent. (2022).
- 43.Kaur, H. et al. Data-efficient fine-tuning of foundational models for first-principles quality sublimation enthalpies. Faraday Discussions, 256, 120-138 (2025). [DOI] [PMC free article] [PubMed]
- 44.Radova, M., Stark, W. G., Allen, C. S., Maurer, R. J. & Bartók, A. P. Fine-tuning foundation models of materials interatomic potentials with frozen transfer learning. npj Comput Mater, 11, 237 (2025). [DOI] [PMC free article] [PubMed]
- 45.Deng, B. et al. Overcoming systematic softening in universal machine learning interatomic potentials by fine-tuning. arXiv.org arXiv: 2405.07105 (2024).
- 46.Zhuang, F. et al. A comprehensive survey on transfer learning. Proc. IEEE109, 43–76 (2020). [Google Scholar]
- 47.Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop (2015).
- 48.Chen, M. S. et al. Data-efficient machine learning potentials from transfer learning of periodic correlated electronic structure methods: Liquid water at AFQMC, ccSD, and ccSD (t) accuracy. J. Chem. Theory Comput.19, 4510–4519 (2023). [DOI] [PubMed] [Google Scholar]
- 49.Zaverkin, V., Holzmüller, D., Bonfirraro, L. & Kästner, J. Transfer learning for chemically accurate interatomic neural network potentials. Phys. Chem. Chem. Phys.25, 5383–5396 (2023). [DOI] [PubMed] [Google Scholar]
- 50.Bocus, M., Vandenhaute, S. & Van Speybroeck, V. The operando nature of isobutene adsorbed in zeolite h-ssz-13 unraveled by machine learning potentials beyond DFT accuracy. Angew. Chem. Int. Ed.64, e202413637 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Röcken, S. & Zavadlav, J. Enhancing machine learning potentials through transfer learning across chemical elements. J. Chem. Inf. Model.65, 7406–7414 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Falk, J., Bonati, L., Novelli, P., Parrinello, M. & Pontil, M. Transfer learning for atomistic simulations using GNNs and kernel mean embeddings. Adv. Neural Inf. Process. Syst.36, 29783–29797 (2023). [Google Scholar]
- 53.Muandet, K., Fukumizu, K., Sriperumbudur, B. & Schölkopf, B. Kernel mean embedding of distributions: a review and beyond. Found. Trends® Mach. Learn.10, 1–141 (2017). [Google Scholar]
- 54.Williams, C. & Seeger, M. Using the nyström method to speed up kernel machines. In Leen, T., Dietterich, T. & Tresp, V. (eds.) Advances in Neural Information Processing Systems (2000).
- 55.Rudi, A., Camoriano, R. & Rosasco, L. Less is more: Nyström computational regularization. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS, 2015).
- 56.Meanti, G., Carratino, L., Rosasco, L. & Rudi, A. Kernel methods through the roof: handling billions of points efficiently. In Advances in Neural Information Processing Systems 32 (NIPS, 2020).
- 57.Rahimi, A. & Recht, B. Random features for large-scale kernel machines. Advances in neural information processing systems20 (2007).
- 58.Shi, L., Guo, X. & Zhou, D.-X. Hermite learning with gradient data. J. Comput. Appl. Math.233, 3046–3059 (2010). [Google Scholar]
- 59.Dhaliwal, G., Nair, P. B. & Singh, C. V. Machine learned interatomic potentials using random features. npj Computational Mater.8, 7 (2022). [Google Scholar]
- 60.Pham, N. & Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 239–247 (Association for Computing Machinery, 2013).
- 61.Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann-Rice, D. N. & Kumar, S. Orthogonal random features. Advances in Neural Information Processing Systems29 (NIPS, 2016).
- 62.Rudi, A. & Rosasco, L. Generalization properties of learning with random features. Advances in Neural Information Processing Systems30 (NeurIPS Proceedings, 2017).
- 63.Mei, S. & Montanari, A. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math.75, 667–766 (2022). [Google Scholar]
- 64.Larsen, A. H. et al. The atomic simulation environment-a python library for working with atoms. J. Phys. Condens. Matter29, 273002 (2017). [DOI] [PubMed] [Google Scholar]
- 65.Thompson, A. P. et al. LAMMPS—a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm.271, 108171 (2022). [Google Scholar]
- 66.Jain, A. et al. Commentary: The materials project: a materials genome approach to accelerating materials innovation. APL Mater.1, 011002 (2013). [Google Scholar]
- 67.Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett.77, 3865–3868 (1996). [DOI] [PubMed] [Google Scholar]
- 68.Owen, C. J. et al. Complexity of many-body interactions in transition metals via machine-learned force fields from the tm23 data set. npj Comput. Mater.10, 92 (2024). [Google Scholar]
- 69.Vandermause, J. et al. On-the-fly active learning of interpretable bayesian force fields for atomistic rare events. npj Comput. Mater.6, 20 (2020). [Google Scholar]
- 70.Lin, I.-C., Seitsonen, A. P., Tavernelli, I. & Rothlisberger, U. Structure and dynamics of liquid water from ab initio molecular dynamics: Comparison of blyp, pbe, and revpbe density functionals with and without van der waals corrections. J. Chem. theory Comput.8, 3902–3910 (2012). [DOI] [PubMed] [Google Scholar]
- 71.Montero de Hijes, P., Dellago, C., Jinnouchi, R., Schmiedmayer, B. & Kresse, G. Comparing machine learning potentials for water: Kernel-based regression and behler-parrinello neural networks. J. Chem. Phys.160, 114107 (2024). [DOI] [PubMed] [Google Scholar]
- 72.Kovács, D. P. et al. MACE-OFF: Short-Range Transferable Machine Learning Force Fields for Organic Molecules. J. Am. Chem. Soci.147, 17598–17611 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Eastman, P. et al. Spice, a dataset of drug-like molecules and peptides for training machine learning potentials. Sci. Data10, 11 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Hodgson, A. & Haq, S. Water adsorption and the wetting of metal surfaces. Surf. Sci. Rep.64, 381–451 (2009). [Google Scholar]
- 75.Nagata, Y., Ohto, T., Backus, E. H. & Bonn, M. Molecular modeling of water interfaces: From molecular spectroscopy to thermodynamics. J. Phys. Chem. B120, 3785–3796 (2016). [DOI] [PubMed] [Google Scholar]
- 76.Mikkelsen, A. E., Schiotz, J., Vegge, T. & Jacobsen, K. W. Is the water/pt (111) interface ordered at room temperature? J. Chem. Phys.155, 224701 (2021). [DOI] [PubMed]
- 77.Perego, S. & Bonati, L. Data efficient machine learning potentials for modeling catalytic reactivity via active learning and enhanced sampling. npj Comput. Mater.10, 291 (2024). [Google Scholar]
- 78.Hospedales, T., Antoniou, A., Micaelli, P. & Storkey, A. Meta-learning in neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intelligence44, 5149–5169 (2021). [DOI] [PubMed] [Google Scholar]
- 79.Falk, J. I. T., Ciliberto, C. & Pontil, M. Implicit kernel meta-learning using kernel integral forms. In Uncertainty in Artificial Intelligence, 652–662 (PMLR, 2022).
- 80.Kelvinius, F. E., Georgiev, D., Toshev, A. P. & Gasteiger, J. Accelerating molecular graph neural networks via knowledge distillation. Adv. Neural Inf. Process. Sys.36, 257611–25792 (2023). [Google Scholar]
- 81.Matin, S. et al. Teacher-student training improves accuracy and efficiency of machine learning interatomic potentials. Digital Discovery, Advance Article (2025).
- 82.Gardner, J. L. A. et al. Distillation of atomistic foundation models across architectures and chemical domains. arXiv.org arXiv: 2506.10956 (2025).
- 83.Amin, I., Raja, S. & Krishnapriyan, A. Towards fast, specialized machine learning force fields: Distilling foundation models via energy hessians. The Thirteenth International Conference on Learning Representations (2025).
- 84.Sipka, M., Erlebach, A. & Grajciar, L. Constructing collective variables using invariant learned representations. J. Chem. Theory Comput.19, 887–901 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Pengmei, Z., Lorpaiboon, C., Guo, S. C., Weare, J. & Dinner, A. R. Using pretrained graph neural networks with token mixers as geometric featurizers for conformational dynamics. J. Chem. Phys.162, 044107 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Trizio, E., Rizzi, A., Piaggi, P. M., Invernizzi, M. & Bonati, L. Advanced simulations with plumed: Opes and machine learning collective variables. arXiv preprint arXiv:2410.18019 (2024).
- 87.Rudin, W.Fourier analysis on groups (Courier Dover Publications, 2017).
- 88.Kresse, G. & Hafner, J. Ab initio molecular dynamics for liquid metals. Phys. Rev. B47, 558 (1993). [DOI] [PubMed] [Google Scholar]
- 89.Kresse, G. & Hafner, J. Ab initio molecular-dynamics simulation of the liquid-metal–amorphous-semiconductor transition in germanium. Phys. Rev. B49, 14251 (1994). [DOI] [PubMed] [Google Scholar]
- 90.Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. b59, 1758 (1999). [Google Scholar]
- 91.Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B54, 11169 (1996). [DOI] [PubMed] [Google Scholar]
- 92.Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational Mater. Sci.6, 15–50 (1996). [DOI] [PubMed] [Google Scholar]
- 93.Kresse, G. & Hafner, J. Norm-conserving and ultrasoft pseudopotentials for first-row and transition elements. J. Phys.: Condens. Matter6, 8245 (1994). [Google Scholar]
- 94.Paszke, A. Pytorch: An imperative style, high-performance deep learning library. Adv. neural inf. process. sys.32, (2019).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used to train interatomic potentials was obtained from https://archive.materialscloud.org/record/2024.48, https://zenodo.org/records/10723405 and https://figshare.com/articles/dataset/Dataset_and_training_files_for_Is_the_water_Pt_111_interface_ordered_at_room_temperature_/14791755?file=29141586.
Our implementation of franken is open source and is available at the following link: https://github.com/CSML-IIT-UCL/franken. The software can be easily installed via pip (pip install franken). The documentation is available here: https://franken.readthedocs.io/, which also includes a tutorial that can be run on Google Colab to train the model and deploy it for molecular dynamics. The implementation is based on PyTorch94. To maximize performance and minimize the memory footprint, we wrote custom CUDA kernels for the computation of the covariance, and support multi-GPU training through torch.distributed. Currently, MACE and SevenNet backbones are supported. franken potentials are interfaced both with the Atomistic Simulation Environment (ASE)64, and, in the case of the MACE backbones, also with LAMMPS65.





