Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2023 Nov 10;63(22):6998–7010. doi: 10.1021/acs.jcim.3c01030

Formulation Graphs for Mapping Structure-Composition of Battery Electrolytes to Device Performance

Vidushi Sharma 1,*, Maxwell Giammona 1, Dmitry Zubarev 1, Andy Tek 1, Khanh Nugyuen 1, Linda Sundberg 1, Daniele Congiu 1, Young-Hye La 1,*
PMCID: PMC10685446  PMID: 37948621

Abstract

graphic file with name ci3c01030_0011.jpg

Advanced computational methods are being actively sought to address the challenges associated with the discovery and development of new combinatorial materials, such as formulations. A widely adopted approach involves domain-informed high-throughput screening of individual components that can be combined together to form a formulation. This manages to accelerate the discovery of new compounds for a target application but still leaves the process of identifying the right “formulation” from the shortlisted chemical space largely a laboratory experiment-driven process. We report a deep learning model, the Formulation Graph Convolution Network (F-GCN), that can map the structure-composition relationship of the formulation constituents to the property of liquid formulation as a whole. Multiple GCNs are assembled in parallel that featurize formulation constituents domain-intuitively on the fly. The resulting molecular descriptors are scaled based on the respective constituent’s molar percentage in the formulation, followed by integration into a combined formulation descriptor that represents the complete formulation to an external learning architecture. The use case of the proposed formulation learning model is demonstrated for battery electrolytes by training and testing it on two exemplary data sets representing electrolyte formulations vs battery performance: one data set is sourced from the literature about Li/Cu half-cells, while the other is obtained by lab experiments related to lithium-iodide full-cell chemistry. The model is shown to predict performance metrics such as Coulombic efficiency (CE) and specific capacity of new electrolyte formulations with the lowest reported errors. The best-performing F-GCN model uses molecular descriptors derived from molecular graphs (GCNs) that are informed with HOMO–LUMO and electric moment properties of the molecules using a knowledge transfer technique.

1. Introduction

Machine Learning (ML) methods have transformed the way materials scientists approach and survey material design spaces. Driven by the rise of high-throughput electronic structure calculation-derived data sets,13 ML methods now present fast and accurate prediction of physicochemical properties of molecules,46 solid-state materials7,8 and combinatorial spaces such as interfaces.9,10 In recent years, generative artificial intelligence (AI) has further taken a leap at hypothesizing novel molecules with well-defined functionalities for different applications.1113 However, the successful application of any discovered novel molecule at the device level is not guaranteed, as the performance of a device goes beyond the sum properties of individual components and mostly relies on the interplay of complex interactions between all its constituent components. There exists a gap between the discovery of new molecules with generative AI and their actual application, which can be bridged with a machine learning or deep learning method representing mixed material systems (Figure 1).

Figure 1.

Figure 1

Learning models for mixed material systems, such as formulations, bridge the gap between the discovery of new molecules and their actual application in the device.

Liquid formulations are one example of such mixed material systems that are a big part of many industrial sectors like pharmaceuticals, automotive materials, food science, coatings, and personal care products.1416 So far, high-throughput screening has been successful in accelerating the search for new individual compounds in multiconstituent systems.17,18 Yet, our current methods have fallen short of directing the complete design of new materials formulations. “Data” is a major prerequisite before scientists can leverage data-driven methods for formulation spaces. Despite the advancements in material simulation techniques, it remains computationally demanding to simulate the properties of whole liquid formulations. The viscosity of mixtures, salt solubilities in mixed solvents, and density of liquid mixtures are a few popular open problems that remain challenging to solve with simulations.1923 Therefore, experimental data continue to be a major fuel to ML engines used in formulation-based predictions.24,25 Gathering high-quality formulation data from experimental procedures to train ML models is a laborious process, especially when the problem statements cross over to device-level applications, such as in the case of battery electrolytes. The discovery and optimization of battery electrolytes has become the subject of paramount importance to climate and sustainable technologies for it will accelerate decarbonization across all economic sectors. Liquid electrolytes in modern energy storage devices typically involve one or more organic solvents and one or more salt additives.26 The formulation of constituent salt-solvents in an electrolyte has been shown to have significant impacts across many cell performance outcomes such as capacity retention, rate performance, and cycle life.2729 A straightforward approach to predict the performance metrics of formulated products such as battery electrolytes is by supervised learning models like regression3032 that take selective features of constituent materials as input.33,34 These input features can be domain intuitive properties of electrolyte constituents, such as redox potentials, dielectric constants, ionic conductivity, solubility, and viscosity. This process requires intelligent input feature selection and reliable methods of calculating them,35 thus bringing forth the critical issue of error transmission from methods estimating the input features to the outcomes of the ML model predicting formulation properties. In the latest attempt by Kim et al.,33 a list of elemental composition (primarily oxygen and fluorine) features is consciously defined to predict the Coulombic efficiency (CE) of lithium metal battery electrolytes using regression models trained on empirical Li/Cu half-cell data. Though the study is an excellent example of a data-driven electrolyte design strategy, the method may not be generalizable to other formulation problems, as it skims over crucial structural and compositional information regarding the formulation constituents.

In the present work, we propose a formulation graph convolution network (F-GCN) model to predict properties of the formulated products, such as battery electrolytes, based solely on constituent’s molecular structures and compositions. Graphs present a natural framework for characterizing the attributes of materials. They offer feature engineering on the fly by learning from the structure, hence bypassing additional feature engineering steps, streamlining uncertainty propagation, and accounting directly for the geometric and structural foundations of the interactions in molecular systems. Molecular systems are represented as graphs to a learning algorithm of graph convolution networks (GCNs),36 that falls under the umbrella of graph neural networks (GNNs).37 GCNs are proven to be effective for molecular level predictions and used widely to map molecular fingerprints,36,38 study targeted interactions between two compounds,39,40 predict molecular properties,4144 learn chemical reactions networks,45 and map atom networks in the crystals.46,47 Graph models can be further extended to a global landscape by either incorporating dense layers or structuring an external graph called a dual graph. Dual graphs successfully map chemical interactions in dense chemical spaces such as crystalline materials45,48 and therefore remain suited for dense solid-state systems. The framework of the proposed F-GCN model maps the structure of the formulants to the outcome metric based on their respective molar compositions. The framework simplifies the process of input feature selection by using single-line molecular identifiers (SMILES) along with molar compositions as input, reduces the requirements for a large formulation data set by incorporating a two-stage learning process, opens the possibility of imparting domain knowledge to a deep learning model for enabling application based customization, and, most importantly, defines the concept of formulation descriptor, i.e., a vector representation of liquid formulations to machine learning models. We use the Li/Cu half-cell data set from the study by Kim et al.33 containing a total of 160 electrolytes to benchmark the proposed methodology. Next, the model is demonstrated to predict the performance of electrolytes for a more intricate system of lithium-iodide (Li–I) full battery, post training on variable electrolyte formulation data from the same system.

2. Formulation Graph Convolution Networks (F-GCNs)

To map the performance of a formulated material, such as a battery electrolyte, it is essential to take the composition of the constituent materials into account. Figure 2 illustrates the architecture of the F-GCN model used in this work to map the structure and composition of electrolyte formulants to the measured performance outcomes of the devices that use those formulations. The model is comprised of two learning architectures: graph convolution networks (GCNs) and deep neural networks (DNNs). The model takes SMILES (Simplified Molecular Input Line Entry System)49 of the electrolyte components as input along with their molar composition as a percentage of the total electrolyte formulation (denoted as Molar % in Figure 2). Formulant SMILES are converted to their molecular conformations using the RDKit package50 and then transformed into molecular graphs for GCNs. A molecular graph contains N nodes represented by fa features in each, where N is the number of atoms in the molecule, and fa is a numeric vector of size 100. A binary one-hot encoded matrix is used as fa for each node, where a nonzero entity (atom’s electronegativity in the present work) is located at the nth position, with n being the atomic number of the atom (see Supporting Information SI-1). The vector size of fa is set to be 100 assuming that the electrolyte formulants in the data sets will not contain any atom with atomic number larger than 100. Symmetric adjacency matrix A designates the connections between the nodes within the graph including the bond types.43 GCNs51 are analogous to convolution neural networks (CNNs) used for images52 where each layer convolutes the previous layer’s features and produces a new set of features combining the information from the neighboring pixels. GCNs extends this concept to graphs by performing convolution functions on each node until the neighborhood of the entire graph is represented in the feature set. GCNs have previously been used as quantitative structure–activity relationship (QSAR) models due to their ability to extract trainable features from unstructured graphs without any prior feature engineering step.5356 F-GCN uses GCNs for feature extraction from formulant structures into the molecular descriptor shown in red in Figure 2(c). Convolution actions are performed on each molecular graph by hidden convolution layers as per Kilf and Welling’s theory,51 to modify the features of nodes (fa). During convolution, the model learns the chemical neighborhood of each atom and updates the respective nodes. As the number of convolutions increases, the chemical neighborhood of each atom is expanded. Post convolutions, each atom is represented by a modified fa which is informative of the chemical neighborhoods within the molecule. The modified fa of all the atoms comprising a molecule are averaged to derive a graph representation (GR) as a 1-D descriptor (vector size = 100) for the whole molecule. This ultimate layer of the GCN model can be assumed to encode important chemical characteristics of the molecule and used as a molecular descriptor input for external learning architecture.57

Figure 2.

Figure 2

Description of the Formulation Graph Convolution Network (F-GCN) model for 3 component formulations: (a) SMILES strings of constituent molecules as input to the F-GCN, (b) Conversion of inputs to molecular graphs for the Graph Convolution Networks (GCNs), (c) Graph representations (GRs) as output from GCNs, also referred to as Molecular descriptors, (d) Molecular descriptors that are scaled based on the respective molar percentage in the formulation, (e) Formulation descriptor (DSfor) representing features of the complete formulation to the external learning network, (f) DNN, the external learning architecture training on formulation data, and (g) Formulation property or label of interest i.e. battery performance for electrolytes.

The distinctive feature of the F-GCN model relative to prior models5860 is the ability to combine molecular descriptors into a formulation descriptor based on the composition (illustrated in Figure 2(c-e)). Considering the molecular descriptor (GR) to be a saturated representation (100%) of an individual molecule, we can alter molecular representations in the model by scaling GR with each molecule’s molar percentage in the formulation as per Equation 1

2. 1

where GRmol1 is the GR matrix of molecule 1, ω is the fraction of the molecule’s molar percentage in the formulation (ω is 0.50 for 50%), and GRmol1 is the scaled GR of molecule 1. The scaled GR (GRmol) of the molecular constituents is then summed together to create a unified descriptor for the formulation (DSfor) as shown in Equation 2:

2. 2

Here, j is the number of total formulants in the formulation. DSfor can be assumed to contain cumulative features of all formulants. DSfor is input to an external learning architecture that maps to the final formulation property, i.e., battery performance metrics here. We use dense feed-forward neural networks (DNNs) as an external learning architecture to map complex nonlinear relationships in the formulation space. One may choose a simplified learning algorithm such as linear regression, kernel methods, or random forest regressor to map the formulation descriptor (DSfor) to the output label based on the intricacy of the problem.

A Python version of the F-GCN model is developed using the Keras API61 with TensorFlow.62 The model has a customizable count of six parallel GCNs for complex electrolyte formulations that comprise up to 6 molecular constituents. Each GCN takes a molecular graph and performs a set of four convolutions before averaging the modified fa into GR. A nonlinear activation function tanH is applied to the output from each convolution layer before passing the new fa to the next layer. GCN architecture stays consistent in the F-GCN model post optimization (see Supporting Information SI-2 for complete details of GCN). Meanwhile, DNN hyperparameters are custom-tuned for different performance metrics. Considering the high degrees of freedom in the model and the small formulation data set derived from experiments, F-GCN is trained with batch size 1 until convergence is reached. The model is converged during training using Adam optimizer63 and with the learning rate as small as 0.0001. A detailed description of the F-GCN framework in TensorFlow is shared in Supporting Information SI-3.

3. Electrolyte Data Sets

3.1. Li/Cu Half-Cell Data Set from the Literature

The F-GCN model is trained and benchmarked with a data set of Li/Cu half-cell-based electrolyte formulations and their respective Coulombic efficiencies (CE) obtained from ref (33). Kim et. al33 curated the data set from the literature and transformed the CE to the logarithmic format (LCE) such that it numerically amplifies the change in the output with respect to the changes in the electrolytes. CE is the ratio of the discharge and charge capacity of the battery, and is an important metric of battery performance. A battery can suffer from CE loss over time due to electrolyte and electrode decomposition. Therefore, electrolyte engineering has become a major strategy to improve CE.

There are 147 electrolyte formulations in the acquired training data set along with an additional 13 electrolyte formulations that are used to evaluate the performance of the model as test data (available in the Supporting Information). Trainable 147 electrolyte formulations in the data set consist of 2 to 6 electrolyte components in each, described by SMILES. To structurally represent the maximum electrolyte components in the model, a set of 6 GCNs is applied in parallel. In electrolyte data points containing less than 6 formulants, the remaining formulants are defaulted to be water, and their molar composition is set to 0 (dummy featurization). This ensures that the model can handle data points with variability in the formulant count. Figure 3 classifies the training data set based on the number of electrolyte components and presents the distribution of LCE output in the data set with the box plots. In Figure 3(a), it is evident that a major fraction of the data set has 3–4 formulants, while only a small count of formulations in the data set (2/147) contains all 6 components. This implies that some of the graphs (GCN) in the model learn dummy molecules for most training unless the data set is largely augmented. For a reliable representation of formulation space, F-GCN must robustly predict the output label despite the translation in the sequence of formulants in the data set, i.e. solvents and salts. This could be achieved by the strategy of knowledge transfer using easily computable simulation data, as discussed in the next section.

Figure 3.

Figure 3

Analysis of training data from the Li/Cu half-cells. (a) Distribution of electrolyte formulation data points based on the formulation constituent count and LCE values. (b) Box plots showing the distribution of LCE outputs for training data based on the formulation constituent count. The outer whiskers represent the minimum and maximum values, the central line represents the median, and the colored box represents the 25th to 75th percentile of the data. The data points outside the outer whiskers are outliers observed in the data.

3.2. Simulation Data Set

Knowledge Transfer (also called Transfer Learning (TL)) is a general approach in deep learning where a model trained for a specific task is reused for another task. TL is deployed in deep learning models to overcome conditions of lacking data sets, necessities to embed domain knowledge, or to bring more transferability for broad application. Knowledge transfer is an additional step in the F-GCN workflow to overcome the limitations of a small experimental data set. With higher numbers of formulants (GCN = 6), dimensionality and complexity of the F-GCN model increase, therefore finding the data set of 100–150 formulations insufficient in capturing the necessary physical, chemical, and synergistic influences in the formulation toward performance label. To overcome the limitations of scarce data and bring more transferability to the F-GCN model, we pretrain GCN on a wider data set of molecules to enable improved accuracies of F-GCN in capturing general features of chemical compounds. There exists chemical and physical data for over 1 billion molecules in open-sourced databases such as ZINC,64 PubChem,65 and QM966 which can be successfully used for pretraining of structural models such as GCN.6769 However, training models with this large pool of available open-sourced molecular data require extensive computational time and a separate focused development. Herein, we demonstrate the application of the concept by pretraining GCN with a smaller in-house created quantum chemical data set of 500 molecules that consists of commonly found solvents and salt additives in Li metal batteries. Density Functional Theory (DFT) calculations are performed with GAMESS software using functional B3LYP and basis set 6-311 G (d,p) for all 500 molecules to calculate their frontier orbitals (HOMO–LUMO levels) and electric dipole moments (EMs). The molecules and their simulated results are provided along with the Supporting Information. The distribution of simulation data used for transfer learning is shown in Figure 4. Box plots in Figure 4 indicate the presence of some outliers in the LUMO and EM data. It is crucial to indicate that these outliers were not excluded from the training. For knowledge transfer, GCN is trained for the molecule’s physicochemical property labels (HOMO–LUMO energy levels and electric moment) as illustrated in Figure 5. GCN architecture used for pretraining is the same as described to be part of F-GCN in Section 2. In the process of pretraining, GCN learns the effects of local chemical neighborhoods in the molecule toward the label. Thereby, molecular descriptors are created that encode molecular information about the targeted specific property (HOMO–LUMO energy levels and electric moment here) through the process of back-propagation.57 These informed molecular descriptors are used to represent formulants in the F-GCN framework to be subsequently combined into a formulation descriptor.

Figure 4.

Figure 4

Box plots showing the distribution of simulation outputs used for creating informed molecular descriptors by using knowledge transfer. (a) Distribution of HOMO–LUMO energy levels for 500 salts and solvents. (b) Distribution of the electric moment of 500 salts and solvents. The simulation data is used for pretraining GCN.

Figure 5.

Figure 5

Knowledge transfer step to create informed molecular descriptors from molecular structures in the F-GCN workflow for electrolyte formulations: (a) GCN is trained on output labels HOMO–LUMO (HL) energy levels and electric moment (EM). (b) Pretrained GCN is used in the F-GCN network, with GCN weights set to nontrainable. GCN outputs uniformly represent formulants in the F-GCN model to external learning architecture that maps them to the overall battery performance.

3.3. Li–I Full-Cell Battery Data Set

A data set of a total of 125 electrolyte formulations and specific capacities is obtained experimentally for Li–I battery coin cells70 represented in Figure 6(a) by performing cycling tests at 1 mA/cm2. While the theoretical capacity of the battery (211 mAh/g-cathode for present Li–I battery70) is determined by the redox couple of cathode and anode materials, the electrolyte plays a key role in maximizing the battery’s practical capacity, along with other performance metrics such as charge rate and cycle life. Especially for lithium–metal batteries based on conversion chemistries, such as a Li–I battery, electrolytes should be formulated carefully to prevent the shuttling of active species and deleterious electrolyte decomposition reactions. Out of 125, 111 electrolyte formulations are used for the training of F-GCN, while random 14 electrolyte formulations are retained as test data sets that are used to validate the performance of a trained F-GCN model. The box plots in Figure 6(b) show the distribution of capacity labels obtained from experiments for the training and test data sets. Complete details of the data collection procedure, cell assembly, and experimental setup are provided in Supporting Information SI-5 and SI-6. As seen in Figure 6(b), the training data are heavy with electrolyte formulations having specific capacities high (above 80 mAh/g) and poor (∼0 mAh/g), yet the average specific capacity of the electrolyte formulations noted in training data is 59 mAh/g. On the other hand, the average capacity noted for the test data set (14 electrolyte formulations) is 72 mAh/g. Note that all the cells had a small degree of variability in cathode formulations in terms of carbon additive, binder, and active material loading percentage (45–55 wt %). For the standard deviation in the experimental capacity as the result of cell-to-cell variability in a Li–I full battery coin cell, see Supporting Information SI-7.

Figure 6.

Figure 6

(a) Schematic representation of the Li–I battery cell that is used to collect electrolyte formulation vs specific capacity experimental data set in the laboratory. (b) The box plots depict the distribution of output labels, i.e., specific capacity obtained from experiments that are used for training and evaluating the performance of the F-GCN model. The outer whiskers represent the minimum and maximum values, the central line represents the median, and the colored box represents the 25th to 75th percentile of the data.

4. Results and Discussion

The goal of the F-GCN model is to featurize a formulation based on concerted representations of molecular constituents and compositions in a mixture. To accomplish this with learning models that have thousands of trainable parameters, a variety of training samples are required. However, obtaining such variant training data is often not feasible, especially when relying on complex empirical procedures involving manual fabrication and evaluation of individual devices. As a result, models can fail to learn generic concepts in the input space. To accomplish meaningful learning with a small formulation data set, the F-GCN is trained within two domains: learning the chemical underpinnings of molecules and learning compositional correlations of molecules in formulations with the final performance. To demonstrate the effectiveness of this segmented learning (Figure 5), we train two separate F-GCN models with each of the two electrolyte data sets: the F-GCN with no Knowledge Transfer (no-TL F-GCN) and the F-GCN with pretrained graphs (TL F-GCN). The results from both models are presented in this section.

4.1. LCE Prediction

We train two F-GCN models (no-TL F-GCN and TL F-GCN) utilizing a data set of electrolyte formulations vs LCE, which was sourced from prior literature.33 A version of the F-GCN model is optimized for predicting LCE by tuning hyperparameters such as the number of hidden layers in the DNN, node count, and activation function by a trial-and-error approach. The performance of different hyperparameters considered for evaluation is further detailed in Supporting Information SI-8. For hyperparameter tuning, 20% of the data set is split for validation and testing. The final external learning model (DNN) is a 3-layered perceptron with each layer having 25–10–1 nodes, the last of which is an output label. Robust convergence during the model training is noted when the rectified linear activation unit (relu) function is applied to the hidden layers of the DNN, while the last layer is connected to the output linearly. It took approximately 5,000 epochs for the model to reach convergence. The performance of the model is evaluated by calculating the mean squared error (MSE) in the predicted outputs (LCE).

Figure 7 summarizes the overall performance of the two F-GCN models (no-TL F-GCN and TL F-GCN). The trained models are tested on 13 test electrolyte formulations, and their respective experiment-derived results are reported by Kim et al.33Figure 7(a) compares the loss (MSE) calculated for training and validation data during the learning process by the F-GCN model (with the best-developed network and no transfer learning) for 5000 epochs. Curves for both the loss functions are sloping toward zero with training and validation MSE being 0.0038 and 0.1704, respectively. This demonstrates a good learning curve for the model. The trained model is further used to predict LCE for unseen 13 electrolyte formulations present in the test data set. The parity plot in Figure 7(b) indicates the correlation between the LCE values predicted by the no-TL F-GCN model and the experimental results. The straight red line in the plots maps the benchmarked experimental values, and scatter points denote the predicted LCE values. The difference of the scatter points from the straight line indicates the errors in the predicted LCE. The MSE value for the plot is shown in the figure to be 0.61.

Figure 7.

Figure 7

(a) Curves for loss function (MSE) calculated for training and validation data during the learning of formulations by the F-GCN model (no TL) for 5000 epochs. (b) Parity plot showing predicted LCE values from the F-GCN where no Knowledge Transfer has been implemented (no TL F-GCN) as scatterplots with respect to the benchmark experimentally derived LCE values mapped on the y = x axis. (c) Curves for loss function (MSE) were calculated for training and validation data during the learning of formulations by the F-GCN model using the pretrained GCN (TL F-GCN) for 5000 epochs. (b) Parity plot showing predicted LCE values from the F-GCN using the GCN pretrained on energy levels and electric moment (TL F-GCN) as scatterplots with respect to the benchmark experimentally derived LCE values mapped on the y = x axis.

The performance of the model improved with prelearned GCN networks (TL-F-GCN model). The curve for validation loss in Figure 7(c) for the TL F-GCN model is sloping downward with an MSE = 0.08. This value is slightly lower than that of the no-TL F-GCN model (validation loss MSE = 0.17). Additionally, LCE predictions on the unseen test electrolytes demonstrate significant improvement in Figure 7(d). The predicted LCE values in Figure 7(d) closely approximate actual experimental values with an MSE of 0.15, which outperforms the MSE of 0.33 reported by Kim et al.33 using regression models on the same data set. This result illustrates the superior scope and promise of the proposed F-GCN model in predicting properties and performance of formulations compared to the alternative methods based on feature selection and regression. Use of transfer learning manages to improve the accuracy of the predictive deep learning model despite being trained on a small data set. This reduced dependence on the quantity of data for deep learning is consistent with a previous study by Karpov et al.71 The predictive capability of a formulation model can be further enhanced by using large-scale pretrained foundational models69 in lieu of the presently pretrained GCN.

The major advantage of the F-GCN is the ability to featurize molecules into a molecular descriptor from SMILES on the fly, thereby overcoming the need for the input feature selection process. Furthermore, by compartmentalized learning of the model with specialized smaller data sets, we induce the scope to impart selective domain knowledge to the model, thereby making the presented approach a promising solution to formulation problems in broad industry sectors. For instance, we pretrain the GCN on HOMO–LUMO (HL) energy levels of electrolyte solvents and salts to encode graph representation (GR) (explained in Section 2) with molecular information on these quantum chemical properties through the process of back-propagation.57 Frontier orbitals (HL) are the most critical properties of the molecules that have far-reaching consequences in their organic and inorganic reactivities. As per the frontier molecular orbital theory, molecules with lower LUMO levels undergo a reduction reaction more easily, while molecules with higher HOMO levels are oxidized first. Thus, HOMO–LUMO of electrolyte components presents an intrinsic window of the working voltage range of a battery and determines the stability of electrolytes over electrodes along with their subsequent chemical reactions.7274 Having such a generic impact across all battery systems, HL have been widely adopted as an important criterion to screen solvents and salts for the development of new battery electrolytes.18,7579 Based on this general importance of HL for battery electrolytes, we pretrain the GCN model with HL before utilizing them in the F-GCN framework for predicting the electrolyte performance. However, if the electrolyte formulation data set targets a specialized concept such as high entropy electrolytes or high-voltage cathode stabilization,73,80 one may use the targeted simulated property for pretraining of GCNs like Li+ solvation or oxidation potential. The GCN can also be trained to encode information regarding more than one molecular property. We demonstrate this concept by using another physicochemical attribute like electric moment (EM) alongside HL as a label in GCN pretraining. EM is the measure of separation of positive and negative charges in the system, indicating the molecule’s overall polarity. Recent studies have determined that the ratio of polar to nonpolar constituents in electrolyte formulation has a significant say in Li+ transport across the electrolyte and its subsequent desolvation phenomenon at the electrode interface.81,82 By pretraining the GCN with more than one molecular property, we impart generalizability to the model toward different categories of electrolyte formulations and enable explicit learning of the chemical character of a compound, thereby overcoming the lack of large molecular data during pretraining, like in the present case. Although incorporation of additional attribute knowledge may or may not improve the overall accuracy of the model. For present data sets, errors observed from TL F-GCN models pretrained with HL alone and HL-EM were found to be similar.

4.2. Li–I Battery Capacity Prediction

A full-cell battery system with a functional cathode and anode has much more complex dependencies than a half-cell system. Currently, most data-driven approaches for battery systems are developed and tested with simplified data sets curated from different literature sources or high-throughput experimentation.33,34,83 We assess the capability of the F-GCN model in predicting another important battery performance metric, i.e., specific capacity, based on variations in electrolyte formulations. The electrolyte formulation vs specific capacity data set was collected during our development of the Oxygen-Assisted Lithium Iodine (OALI) battery, a heavy metal-free next-generation battery that promises high power and fast charging capabilities by forming a robust solid electrolyte interface (SEI).70 The compiled data set effectively captures a realistic range of electrolytes for next-generation lithium–metal batteries based on the conversion chemistries.

A separate version of the F-GCN model is optimized for predicting battery capacities (in mAh/g). Hyperparameters of the DNN are retuned for predicting the new output matrix (see Supporting Information SI-9). The optimized F-GCN has a 4-layered perceptron as the external learning architecture with each layer having 1000–100–10 nodes, the last of which is connected to an output label. It took approximately 30,000 epochs for the F-GCN to converge. The performance of the model is evaluated by calculating the root mean squared error (RMSE) in the predicted capacities with respect to the experimentally noted capacities. It is crucial to note here that the nodal architecture of the DNN became much more extensive for predicting specific capacity retained by a full battery than the simplified case of predicting CE in the Li/Cu half-cell as seen in Section 4.1. This reverberates an important fact that the relationship between the electrolyte and battery performance becomes much more intricate in a full battery cell. The model must take electrolyte-electrode interactions into account for accurately predicting the overall battery performance. The F-GCN model is applied to the electrolyte formulation problem with an assumption that the rest of the battery components, including electrodes, current collector, separator, and the volume of electrolyte, are mostly consistent for the data set. The experimentally collected electrolyte formulation vs specific capacity data set encodes the intelligence of electrolyte-electrode interactions in the performance metric. For instance, capacity retention for electrolytes with high concentrations of polar protic solvents is consistently low as they conceptually react with electrodes deliriously. Similar trends of electrolyte component’s interaction with the electrode are transmitted to the performance metric considered for learning here, with a requisite that electrode attributes are consistent for the said data set.

Figure 8 summarizes the performance of two F-GCN models (no-TL F-GCN and TL F-GCN) trained with an electrolyte formulation data set derived from the Li–I full cell. The F-GCN model using informed molecular descriptors (TL F-GCN) outperforms the model with no knowledge transfer (no-TL F-GCN) by demonstrating a relatively lower RMSE for predicted battery capacities. The predicted values from both models are plotted against experimental capacities in parity plots shown in Figure 8. The RMSE value for each plot is indicated in the figure itself. The predicted capacities in Figure 8(b) closely overlap with actual experimental values with an RMSE of 20.46 mAh/g from the TL F-GCN. Besides predictive precision, the model successfully segregates high-performing electrolytes from low-performing ones. The RMSE of 20–21 mAh/g in the capacity prediction is expected for the current data set as experimental battery capacity values inherit some uncertainties based on unintended variations during battery assembly and material preparation processes. The standard deviation (cell-to-cell variation) of experimental specific capacities is in the range of 15–30 mAh/g (SI-7) when the same electrolyte formulation is used in an OALI battery. This uncertainty in experimental values originates mainly from very slight variations observed in active material loading in the cathode, as indicated in section 3.3 and the Supporting Information. Thus, the model predicts the performance pertaining to the electrolyte formulations with errors that are within the bounds of experimental cell-to-cell variability.

Figure 8.

Figure 8

Parity plots showing predicted battery capacities (in mAh/g) as scatterplots with respect to the benchmark experimentally derived capacity values mapped on the y = x axis. (a) F-GCN where no Knowledge Transfer has been implemented (no-TL F-GCN), (b) F-GCN with HOMO–LUMO (HL) and electric moment (EM) informed molecular descriptors. RMSE values for both of the predicted sets are indicated in the figures, respectively.

4.3. Li–I Data Variance and Predictive Uncertainties

It is evident in Figure 8(b) that the TL F-GCN model does an excellent job of recognizing high-performing and low-performing electrolyte formulations despite oncoming uncertainties from experiments and small training data. However, notable errors are observed for low-mid capacity range (30–60 mAh/g) electrolyte formulations. This is indicative of biased training data that provide insufficient coverage of the full range of battery capacity, particularly, in the low-mid capacity range. The limitations of the current data set are examined in Figure 9 which describes the distribution of the training data with the help of heat maps in terms of solvent types being used in the electrolyte formulations (Figure 9(a)) and formulant compositions (solvent and salt compositions in Figure 9(b-c)). The solvent types are depicted in the heat map with labels 0–20 and further detailed in the color bar beside the heat map in Figure 9(a). Figure 9(d) maps the corresponding specific capacities of the battery cells. The solvent classification map in Figure 9(a) shows that the primary and the highest concentration Solvent A is mostly varied between 3 classes of solvents (cyclic ether, ether nitrile, and heterocyclic acetal). In contrast, significant variations in the category of cosolvent Solvent B are seen. The third solvent, Solvent C, is mostly used as an additive in about 12% of the training data set. Figure 9(b) elaborates on the molar percentage of each solvent in its respective formulation. The highest concentrations are noted for Solvent A (45–70 mol %), followed by Solvent B (5–50 mol %) and Solvent C (0–20 mol %). Opposed to the solvent trends in the data set, the 3 salts in the formulations were fixed and are lithium bis(trifluoromethyl) sulfonylimide (LiTFSI) as Salt A, lithium bis(oxalato)borate (LiBOB) as Salt B, and lithium nitrate (LiNO3) as Salt C, respectively. Salt C is added to all electrolyte formulations with no significant concentration alterations (∼2 mol %). Meanwhile, Salt A and Salt B are present in 40% of the training data set with their concentration varying from 2.5 to 7 mol %. Based on these limited variations in the electrolyte formulations, the battery capacities of the training data are predominantly distributed either in a higher range (>80 mAh/g, well-performing) or close to zero (not performing) as shown in Figure 9(d). Consequently, this leads to relatively poor predictions for the low-mid capacity range. While there is acceptable disparity noted in the category of solvents in the training data as summarized in Figure 9, the scope of the model could be further improved if the training data are more inclusive of different salts and well-diversified concentration ranges. This highlights the need for an ideal design of experiments (DOE) that enables data set sampling suitable to address a complex problem for multivariable materials space such as optimizing the formulation of a materials mixture.

Figure 9.

Figure 9

Variations in the electrolyte formulation training data set. Labels on the y-axis are identical across the plots and encode the items in the data set. (a) Heat map depicting variations in the types of three electrolyte constituent solvents used in the training data. (b) Heat map depicting variations in molar percentage of 3 constituent solvents in the training data. (c) Heat map depicting variations in molar percentage of 3 constituent salts in the training data. (d) Heat map depicting battery capacities obtained from the experiments corresponding to the electrolyte formulations in the training data.

Lastly, we determine the uncertainty in model predictions that arises due to inherent randomness in the input observations. Being independent of the model’s parameters, this uncertainty cannot be reduced by increasing the count of training data.84 The uncertainty in the predictions from the model is evaluated by training TL F-GCN ensembles using the popular bootstrapping strategy.85,86 Ensembles use multiple F-GCNs that have randomly initialized weights (with the exception of pretrained GCN weights) and get trained on different bootstrap samples of the original training data. Since the original Li–I electrolyte training data (111 formulations) is too small for bootstrapping, an augmented data set of electrolyte formulations (see SI-5 for details) is bootstrapped into 4 random subsamples for training F-GCNs in the ensemble. We note that random initialization of DNN parameters, random shuffling of training data points, and bootstrapping are sufficient to observe predictive uncertainties. Ensembles of the TL F-GCN are trained following the same procedure as described in section 4.2 and are then used to predict the capacities of 14 test electrolyte formulations. The average capacity predictions from the ensembles are plotted in Figure 10 along with the standard deviations in the predictions for each of the 14 test electrolytes. Large uncertainties are observed in the predictions for electrolytes falling in low-to-mid range capacities (30–60 mAh/g, test electrolyte samples 3, 5, and 13 in Figure 10), which can be attributed to the limited coverage of such data in the training data set as previously described. The TL F-GCN model employing the HL-EM descriptor exhibits an outstanding prediction accuracy for about 65% of test data points (9 of 14) with an approximate standard deviation of 10 mAh/g.

Figure 10.

Figure 10

Average predicted capacities (in mAh/g) from TL F-GCN ensembles for the 14 test electrolyte formulations. The error bars denote the standard deviation (STD) of the predictions from the ensembles. The mean STD for the 14 electrolyte formulations is indicated in the figure.

4.4. Future Scope

With advancements in computation and data-driven approaches, there exists surface optimism about the scope of these techniques in driving the discovery and optimization of new materials. However, as we dive deeper into the solution-driven application of these techniques, there is a significant lapse in the premise and actual practice. As simulation techniques struggle with shortcomings associated with computational requirements and theoretical assumptions, data-driven methods such as machine learning are priced high due to having prerequisites of structured, high-quality materials data. To address the most pressing predictive and discovery challenges among materials, a simulation-experiment-AI synergistic approach needs to be developed where an expansive simulation data set could be streamlined with limited experimental data in order to meet the ‘data’ and ‘knowledge’ requisites of an AI model, thereby making it a more reliable in-silico solution. The F-GCN aims to incorporate these requisites to solve multidimensional material problems as demonstrated in the case of battery electrolyte formulations. What makes this model generalizable for formulation problems across different applications is the ground concept of featurizing the structure of molecular constituents and their respective compositions into a formulation descriptor for relating the output. This formulation descriptor could be combined with any learning algorithm to perform a variety of downstream tasks, such as prediction (as demonstrated in the present work) and composition optimization. The proposed framework can be used to find the right composition of formulation constituents, especially when the design of the targeted chemical space is vast (large count of formulation constituents) and may otherwise require a brute-force experimental approach.18 This could be obtained by affixing the formulation constituents and varying the performance label over a range of constituent compositions. A good example of this would be the integration of the F-GCN in an accelerated electrolyte discovery workflow where electrolyte solvents and salts have been shortlisted with virtual high throughput screening87 and require further compositional fine-tuning based on existing data for the battery. Due to the existence of diverse battery chemistries in the field, electrolyte formulation data sets are mostly nongeneralizable. For each set of cathodes and anodes, electrolyte formulations require specialized development and are highly guarded trade secrets. Therefore, a framework that can accelerate the development of electrolyte formulation design with limited experimental efforts is most welcomed in the field. As demonstrated for the OALI battery, the initial battery cell tests that are used to develop the battery chemistry could be utilized for learning and then driving the optimization of new electrolyte formulation, resulting in a fewer number of actual electrolyte optimization experiments.

Due to the underlying framework being very general, the F-GCN model can find use in applications beyond electrolytes, especially where formulation data sets are scarce, and we need to incorporate domain knowledge to enhance accuracy. By pretraining molecular structures on labels identified as critical in the subject matter, we couple learning molecular geometries with domain knowledge transfer. Hence, the resulting molecular descriptors could be made unique for the problem system. We demonstrate the proof of concept by pretraining the molecular structure model with a small, specialized simulation data set that contained targeted electrolyte molecules. The accuracy of the formulation model could be further improved by describing molecules with pretrained foundational models such as MoLFormer69 in substitute of the GCN.

5. Conclusions

In conclusion, we propose a graph-based deep learning model, F-GCN, that maps structure-composition-performance relationships within the formulation space. The F-GCN model consists of six graphs for formulation constituents and an external learning architecture. The model is trained within two domains: the learning of molecular graphs and the learning of formulations. The proposed approach is tested with two different electrolyte formulation vs performance data sets: Li/Cu half-cell data sourced from the literature, and Li–I full-cell data acquired experimentally in our lab. The TL F-CGN model trained and validated with the Li/Cu half-cell data exhibited enhanced accuracy in predicting logarithmic Coulombic efficiency with an MSE of 0.15, surpassing the performance of previous ML approaches in the literature including regression and kernel methods.33 The TL F-CGN model also demonstrated outstanding predictive performance for battery capacities of Li–I full cells, achieving an RMSE of 20 mAh/g and an approximate standard deviation of 10 mAh/g within the ensembles. Given the inherent cell-to-cell variations resulting from experimental cell assembly and operation of these conversion batteries, the accuracy of the F-GCN prediction model depends more on the quality of the training data rather than the quantity. The proposed model attempts to simplify data-driven discovery and optimization of mixed materials such as formulations, where data pose unstructured relationships.

Data Availability Statement

The experimental data and simulation data used in this paper can be found along with the Supporting Information. Code used in the study is presented in the Supporting Information. The simulation workflows and tools by Simulation Toolkit for Scientific Discovery (ST4SD) are used to generate simulation data in this study which are now available to the open-source community at https://st4sd.github.io/overview/. Restrictions may apply to the availability of the simulation codes which are protected under the software license.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.3c01030.

  • Molecular graph representation; Graph convolution networks (GCNs); F-GCNs; Li/Cu half-cell electrolyte data set; Li–I full-cell data set; Battery experiment details; Cell-to-cell variability in experiments; Hyperparameter tuning for LCE prediction; Hyperparameter tuning for capacity prediction (PDF)

  • Excel file 1 (XLSX)

  • Excel file 2 (XLSX)

  • Excel file 3 (XLSX)

Author Contributions

V.S., M.G., Y.H.L., and D.Z. devised the idea of the project. V.S. implemented the idea and performed primary coding, model training, validation, and manuscript preparation. M.G., A.T., K.N., and L.S. performed the battery experiments to collect the training and testing data for the study. V.S. collected the simulation data using open-sourced DFT simulation workflows on Simulation Toolkit for Scientific Discovery (ST4SD). D.C. curated data from literature sources for benchmarking.

The authors declare no competing financial interest.

Supplementary Material

ci3c01030_si_001.pdf (1.2MB, pdf)
ci3c01030_si_002.xlsx (34.1KB, xlsx)
ci3c01030_si_003.xlsx (20.4KB, xlsx)
ci3c01030_si_004.xlsx (28.1KB, xlsx)

References

  1. Blum L. C.; Reymond J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 2009, 131 (25), 8732–8733. 10.1021/ja902302h. [DOI] [PubMed] [Google Scholar]
  2. Ramakrishnan R.; Dral P. O.; Rupp M.; Von Lilienfeld O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1 (1), 140022. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ruddigkeit L.; Van Deursen R.; Blum L. C.; Reymond J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 2012, 52 (11), 2864–2875. 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
  4. Montavon G.; Rupp M.; Gobre V.; Vazquez-Mayagoitia A.; Hansen K.; Tkatchenko A.; Müller K.-R.; Von Lilienfeld O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15 (9), 095003. 10.1088/1367-2630/15/9/095003. [DOI] [Google Scholar]
  5. Hansen K.; Biegler F.; Ramakrishnan R.; Pronobis W.; Von Lilienfeld O. A.; Müller K.-R.; Tkatchenko A. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. J. Phys. Chem. Lett. 2015, 6 (12), 2326–2331. 10.1021/acs.jpclett.5b00831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Schütt K. T.; Arbabzadah F.; Chmiela S.; Müller K. R.; Tkatchenko A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 2017, 8 (1), 13890. 10.1038/ncomms13890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Schütt K. T.; Glawe H.; Brockherde F.; Sanna A.; Müller K.-R.; Gross E. K. How to represent crystal structures for machine learning: Towards fast prediction of electronic properties. Phys. Rev. B 2014, 89 (20), 205118. 10.1103/PhysRevB.89.205118. [DOI] [Google Scholar]
  8. Schütt K. T.; Sauceda H. E.; Kindermans P.-J.; Tkatchenko A.; Müller K.-R. Schnet-a deep learning architecture for molecules and materials. J. Chem. Phys. 2018, 148 (24), 241722. 10.1063/1.5019779. [DOI] [PubMed] [Google Scholar]
  9. Eckhoff M.; Behler J. Insights into lithium manganese oxide-water interfaces using machine learning potentials. J. Chem. Phys. 2021, 155 (24), 244703. 10.1063/5.0073449. [DOI] [PubMed] [Google Scholar]
  10. Ghorbanfekr H.; Behler J. r.; Peeters F. M. Insights into water permeation through hBN nanocapillaries by ab initio machine learning molecular dynamics simulations. J. Phys. Chem. Lett. 2020, 11 (17), 7363–7370. 10.1021/acs.jpclett.0c01739. [DOI] [PubMed] [Google Scholar]
  11. Bian Y.; Xie X.-Q. Generative chemistry: drug discovery with deep learning generative models. J. Mol. Model. 2021, 27, 71. 10.1007/s00894-021-04674-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gao W.; Coley C. W. The synthesizability of molecules proposed by generative models. J. Chem. Inf. Model. 2020, 60 (12), 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]
  13. Bilodeau C.; Jin W.; Jaakkola T.; Barzilay R.; Jensen K. F. Generative models for molecular discovery: Recent advances and challenges. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022, 12 (5), e1608 10.1002/wcms.1608. [DOI] [Google Scholar]
  14. Conte E.; Gani R.. Chemicals-based formulation design: virtual experimentations. Computer Aided Chemical Engineering; Elsevier: 2011; Vol. 29, pp 1588–1592, 10.1016/B978-0-444-54298-4.50096-9. [DOI] [Google Scholar]
  15. Taifouris M.; Martín M.; Martínez A.; Esquejo N. Challenges in the design of formulated products: multiscale process and product design. Curr. Opin. Chem. Eng. 2020, 27, 1–9. 10.1016/j.coche.2019.10.001. [DOI] [Google Scholar]
  16. Lee C.; Choy K. L.; Chan Y. A knowledge-based ingredient formulation system for chemical product development in the personal care industry. Comput. Chem. Eng. 2014, 65, 40–53. 10.1016/j.compchemeng.2014.03.004. [DOI] [Google Scholar]
  17. Cheng L.; Assary R. S.; Qu X.; Jain A.; Ong S. P.; Rajput N. N.; Persson K.; Curtiss L. A. Accelerating electrolyte discovery for energy storage with high-throughput screening. J. Phys. Chem. Lett. 2015, 6 (2), 283–291. 10.1021/jz502319n. [DOI] [PubMed] [Google Scholar]
  18. Benayad A.; Diddens D.; Heuer A.; Krishnamoorthy A. N.; Maiti M.; Cras F. L.; Legallais M.; Rahmanian F.; Shin Y.; Stein H.; et al. High-throughput experimentation and computational freeway lanes for accelerated battery electrolyte and interface development research. Adv. Energy Mater. 2022, 12 (17), 2102678. 10.1002/aenm.202102678. [DOI] [Google Scholar]
  19. Gupta J.; Nunes C.; Vyas S.; Jonnalagadda S. Prediction of solubility parameters and miscibility of pharmaceutical compounds by molecular dynamics simulations. J. Phys. Chem. B 2011, 115 (9), 2014–2023. 10.1021/jp108540n. [DOI] [PubMed] [Google Scholar]
  20. Ma Y.; Cao Y.; Yang Y.; Li W.; Shi P.; Wang S.; Tang W. Thermodynamic analysis and molecular dynamic simulation of the solubility of vortioxetine hydrobromide in three binary solvent mixtures. J. Mol. Liq. 2018, 272, 676–688. 10.1016/j.molliq.2018.09.130. [DOI] [Google Scholar]
  21. Sha J.; Yang X.; Ji L.; Cao Z.; Niu H.; Wan Y.; Sun R.; He H.; Jiang G.; Li Y.; et al. Solubility determination, model evaluation, Hansen solubility parameter, molecular simulation and thermodynamic properties of benflumetol in four binary solvent mixtures from 278.15 to 323.15 K. J. Mol. Liq. 2021, 333, 115867. 10.1016/j.molliq.2021.115867. [DOI] [Google Scholar]
  22. Aquing M.; Ciotta F.; Creton B.; Féjean C.; Pina A.; Dartiguelongue C.; Trusler J. M.; Vignais R.; Lugo R.; Ungerer P.; et al. Composition analysis and viscosity prediction of complex fuel mixtures using a molecular-based approach. Energy Fuels 2012, 26 (4), 2220–2230. 10.1021/ef300106z. [DOI] [Google Scholar]
  23. Srinivas G.; Mukherjee A.; Bagchi B. Nonideality in the composition dependence of viscosity in binary mixtures. J. Chem. Phys. 2001, 114 (14), 6220–6228. 10.1063/1.1354166. [DOI] [Google Scholar]
  24. Hezave A. Z.; Lashkarbolooki M.; Raeissi S. Using artificial neural network to predict the ternary electrical conductivity of ionic liquid systems. Fluid Ph. Equilib. 2012, 314, 128–133. 10.1016/j.fluid.2011.10.028. [DOI] [Google Scholar]
  25. Fatehi M.-R.; Raeissi S.; Mowla D. Estimation of viscosities of pure ionic liquids using an artificial neural network based on only structural characteristics. J. Mol. Liq. 2017, 227, 309–317. 10.1016/j.molliq.2016.11.133. [DOI] [Google Scholar]
  26. Jie Y.; Ren X.; Cao R.; Cai W.; Jiao S. Advanced liquid electrolytes for rechargeable Li metal batteries. Adv. Funct. Mater. 2020, 30 (25), 1910777. 10.1002/adfm.201910777. [DOI] [Google Scholar]
  27. Yoon H.; Howlett P.; Best A. S.; Forsyth M.; Macfarlane D. R. Fast charge/discharge of Li metal batteries using an ionic liquid electrolyte. J. Electrochem. Soc. 2013, 160 (10), A1629. 10.1149/2.022310jes. [DOI] [Google Scholar]
  28. Yu Z.; Rudnicki P. E.; Zhang Z.; Huang Z.; Celik H.; Oyakhire S. T.; Chen Y.; Kong X.; Kim S. C.; Xiao X.; et al. Rational solvent molecule tuning for high-performance lithium metal battery electrolytes. Nat. Energy 2022, 7 (1), 94–106. 10.1038/s41560-021-00962-y. [DOI] [Google Scholar]
  29. Lu Y.; Tikekar M.; Mohanty R.; Hendrickson K.; Ma L.; Archer L. A. Stable cycling of lithium metal batteries using high transference number electrolytes. Adv. Energy Mater. 2015, 5 (9), 1402073. 10.1002/aenm.201402073. [DOI] [Google Scholar]
  30. Hastie T.; Tibshirani R.; Friedman J. H.; Friedman J. H.. The elements of statistical learning: data mining, inference, and prediction; Springer: 2009. [Google Scholar]
  31. Koppel A.; Pradhan H.; Rajawat K. Consistent online gaussian process regression without the sample complexity bottleneck. Stat. Comput. 2021, 31 (6), 76. 10.1007/s11222-021-10051-5. [DOI] [Google Scholar]
  32. Shterev I. D.; Dunson D. B.; Chan C.; Sempowski G. D. Bayesian multi-plate high-throughput screening of compounds. Sci. Rep. 2018, 8 (1), 9551. 10.1038/s41598-018-27531-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kim S. C.; Oyakhire S. T.; Athanitis C.; Wang J.; Zhang Z.; Zhang W.; Boyle D. T.; Kim M. S.; Yu Z.; Gao X.; et al. Data-driven electrolyte design for lithium metal anodes. Proc. Natl. Acad. Sci. U.S.A. 2023, 120 (10), e2214357120 10.1073/pnas.2214357120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Narayanan Krishnamoorthy A.; Wölke C.; Diddens D.; Maiti M.; Mabrouk Y.; Yan P.; Grünebaum M.; Winter M.; Heuer A.; Cekic-Laskovic I. Data-Driven Analysis of High-Throughput Experiments on Liquid Battery Electrolyte Formulations: Unraveling the Impact of Composition on Conductivity. Chemistry-Methods 2022, 2 (9), e202200008 10.1002/cmtd.202200008. [DOI] [Google Scholar]
  35. Czop P.; Kost G.; Sławik D.; Wszołek G. Formulation and identification of first-principle data-driven models. J. Achiev. Mater. Manuf. Eng. 2011, 44 (2), 179–186. [Google Scholar]
  36. Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P. Molecular graph convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des. 2016, 30 (8), 595–608. 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Xiong J.; Xiong Z.; Chen K.; Jiang H.; Zheng M. Graph neural networks for automated de novo drug design. Drug Discovery Today 2021, 26 (6), 1382–1393. 10.1016/j.drudis.2021.02.011. [DOI] [PubMed] [Google Scholar]
  38. Duvenaud D. K.; Maclaurin D.; Iparraguirre J.; Bombarell R.; Hirzel T.; Aspuru-Guzik A.; Adams R. P.. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems; 2015; Vol. 28.
  39. Huang Y.-a.; Hu P.; Chan K. C.; You Z.-H. Graph convolution for predicting associations between miRNA and drug resistance. Bioinform. 2020, 36 (3), 851–858. 10.1093/bioinformatics/btz621. [DOI] [PubMed] [Google Scholar]
  40. Sun M.; Zhao S.; Gilvary C.; Elemento O.; Zhou J.; Wang F. Graph convolutional networks for computational drug development and discovery. Brief. Bioinform. 2020, 21 (3), 919–935. 10.1093/bib/bbz042. [DOI] [PubMed] [Google Scholar]
  41. Khemchandani Y.; O’Hagan S.; Samanta S.; Swainston N.; Roberts T. J.; Bollegala D.; Kell D. B. DeepGraphMolGen, a multi-objective, computational strategy for generating molecules with desirable properties: a graph convolution and reinforcement learning approach. J. Cheminform. 2020, 12 (1), 53. 10.1186/s13321-020-00454-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Tsubaki M.; Mizoguchi T.. On the equivalence of molecular graph convolution and molecular wave function with poor basis set. Advances in Neural Information Processing Systems; 2020; Vol. 33, pp 1982–1993.
  43. Li X.; Yan X.; Gu Q.; Zhou H.; Wu D.; Xu J. Deepchemstable: chemical stability prediction with an attention-based graph convolution network. J. Chem. Inf. Model. 2019, 59 (3), 1044–1049. 10.1021/acs.jcim.8b00672. [DOI] [PubMed] [Google Scholar]
  44. Shang C.; Liu Q.; Tong Q.; Sun J.; Song M.; Bi J. Multi-view spectral graph convolution with consistent edge attention for molecular modeling. Neurocomputing 2021, 445, 12–25. 10.1016/j.neucom.2021.02.025. [DOI] [Google Scholar]
  45. Harada S.; Akita H.; Tsubaki M.; Baba Y.; Takigawa I.; Yamanishi Y.; Kashima H. Dual graph convolutional neural network for predicting chemical networks. BMC Bioinform. 2020, 21 (3), 94. 10.1186/s12859-020-3378-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Chen C.; Ye W.; Zuo Y.; Zheng C.; Ong S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 2019, 31 (9), 3564–3572. 10.1021/acs.chemmater.9b01294. [DOI] [Google Scholar]
  47. Gong W.; Yan Q. Graph-based deep learning frameworks for molecules and solid-state materials. Comput. Mater. Sci. 2021, 195, 110332. 10.1016/j.commatsci.2021.110332. [DOI] [Google Scholar]
  48. Zhang S.; Tong H.; Xu J.; Maciejewski R. Graph convolutional networks: a comprehensive review. Comput. Soc. Netw. 2019, 6 (1), 11. 10.1186/s40649-019-0069-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28 (1), 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
  50. Landrum G.RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling; 2013.
  51. Kipf T. N.; Welling M.. Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907. 2016. https://arxiv.org/abs/1609.02907 (accesssed 2023-11-08).
  52. Gu J.; Wang Z.; Kuen J.; Ma L.; Shahroudy A.; Shuai B.; Liu T.; Wang X.; Wang G.; Cai J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. 10.1016/j.patcog.2017.10.013. [DOI] [Google Scholar]
  53. Liu K.; Sun X.; Jia L.; Ma J.; Xing H.; Wu J.; Gao H.; Sun Y.; Boulnois F.; Fan J. Chemi-Net: a molecular graph convolutional network for accurate drug property prediction. Int. J. Mol. Sci. 2019, 20 (14), 3389. 10.3390/ijms20143389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Pan X.; Wang H.; Li C.; Zhang J. Z.; Ji C. MolGpka: A web server for small molecule p K a prediction using a graph-convolutional neural network. J. Chem. Inf. Model. 2021, 61 (7), 3159–3165. 10.1021/acs.jcim.1c00075. [DOI] [PubMed] [Google Scholar]
  55. Montanari F.; Kuhnke L.; Ter Laak A.; Clevert D.-A. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Molecules 2020, 25 (1), 44. 10.3390/molecules25010044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Moore G. J.; Bardagot O.; Banerji N. Deep Transfer Learning: A Fast and Accurate Tool to Predict the Energy Levels of Donor Molecules for Organic Photovoltaics. Adv. Theory Simul. 2022, 5 (5), 2100511. 10.1002/adts.202100511. [DOI] [Google Scholar]
  57. Tsubaki M.; Mizoguchi T. Quantum deep descriptor: Physically informed transfer learning from small molecules to polymers. J. Chem. Theory Comput. 2021, 17 (12), 7814–7821. 10.1021/acs.jctc.1c00568. [DOI] [PubMed] [Google Scholar]
  58. Narayanan H.; Dingfelder F.; Butté A.; Lorenzen N.; Sokolov M.; Arosio P. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol. Sci. 2021, 42 (3), 151–165. 10.1016/j.tips.2020.12.004. [DOI] [PubMed] [Google Scholar]
  59. Bannigan P.; Aldeghi M.; Bao Z.; Häse F.; Aspuru-Guzik A.; Allen C. Machine learning directed drug formulation development. Adv. Drug Delivery Rev. 2021, 175, 113806. 10.1016/j.addr.2021.05.016. [DOI] [PubMed] [Google Scholar]
  60. McCoubrey L. E.; Seegobin N.; Elbadawi M.; Hu Y.; Orlu M.; Gaisford S.; Basit A. W. Active Machine learning for formulation of precision probiotics. Int. J. Pharm. 2022, 616, 121568. 10.1016/j.ijpharm.2022.121568. [DOI] [PubMed] [Google Scholar]
  61. Chollet F.Keras: The python deep learning library; Astrophysics source code library, 2018; ascl:1806.1022.
  62. Abadi M.; Barham P.; Chen J.; Chen Z.; Davis A.; Dean J.; Devin M.; Ghemawat S.; Irving G.; Isard M.. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16); 2016; pp 265–283.
  63. Zhang Z.Improved adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS); IEEE: 2018; pp 1–2.
  64. Irwin J. J.; Sterling T.; Mysinger M. M.; Bolstad E. S.; Coleman R. G. ZINC: a free tool to discover chemistry for biology. J. Chem. Inf. Model. 2012, 52 (7), 1757–1768. 10.1021/ci3001277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Kim S.; Thiessen P. A.; Bolton E. E.; Chen J.; Fu G.; Gindulyte A.; Han L.; He J.; He S.; Shoemaker B. A.; et al. PubChem substance and compound databases. Nucleic Acids Res. 2016, 44 (D1), D1202–D1213. 10.1093/nar/gkv951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Liang J.; Xu Y.; Liu R.; Zhu X. QM-sym, a symmetrized quantum chemistry database of 135 kilo molecules. Sci. Data 2019, 6 (1), 213. 10.1038/s41597-019-0237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Pereira F.; Xiao K.; Latino D. A.; Wu C.; Zhang Q.; Aires-de-Sousa J. Machine learning methods to predict density functional theory B3LYP energies of HOMO and LUMO orbitals. J. Chem. Inf. Model. 2017, 57 (1), 11–21. 10.1021/acs.jcim.6b00340. [DOI] [PubMed] [Google Scholar]
  68. Pan J. Large language model for molecular chemistry. Nat. Comput. Sci. 2023, 3 (1), 5–5. 10.1038/s43588-023-00399-1. [DOI] [PubMed] [Google Scholar]
  69. Ross J.; Belgodere B.; Chenthamarakshan V.; Padhi I.; Mroueh Y.; Das P. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell. 2022, 4 (12), 1256–1264. 10.1038/s42256-022-00580-7. [DOI] [Google Scholar]
  70. Giammona M. J.; Kim J.; Kim Y.; Medina P.; Nguyen K.; Bui H.; Jones G. O.; Tek A. T.; Sundberg L.; Fong A.; et al. Oxygen Assisted Lithium-Iodine Batteries: Towards Practical Iodine Cathodes and Viable Lithium Metal Protection Strategies. Adv. Mater. Interfaces 2023, 10, 2300058. 10.1002/admi.202300058. [DOI] [Google Scholar]
  71. Karpov K.; Mitrofanov A.; Korolev V.; Tkachenko V. Size doesn’t matter: predicting physico-or biochemical properties based on dozens of molecules. J. Phys. Chem. Lett. 2021, 12 (38), 9213–9219. 10.1021/acs.jpclett.1c02477. [DOI] [PubMed] [Google Scholar]
  72. Halls M. D.; Tasaki K. High-throughput quantum chemistry and virtual screening for lithium ion battery electrolyte additives. J. Power Sources 2010, 195 (5), 1472–1478. 10.1016/j.jpowsour.2009.09.024. [DOI] [Google Scholar]
  73. Fan X.; Wang C. High-voltage liquid electrolytes for Li batteries: progress and perspectives. Chem. Soc. Rev. 2021, 50 (18), 10486–10566. 10.1039/D1CS00450F. [DOI] [PubMed] [Google Scholar]
  74. Chen M.; Zhang J.; Ji X.; Fu J.; Feng G. Progress on predicting the electrochemical stability window of electrolytes. Curr. Opin. Electrochem. 2022, 34, 101030. 10.1016/j.coelec.2022.101030. [DOI] [Google Scholar]
  75. Luo D.; Li M.; Zheng Y.; Ma Q.; Gao R.; Zhang Z.; Dou H.; Wen G.; Shui L.; Yu A.; et al. Electrolyte Design for Lithium Metal Anode-Based Batteries Toward Extreme Temperature Application. Adv. Sci. 2021, 8 (18), 2101051. 10.1002/advs.202101051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Peljo P.; Girault H. H. Electrochemical potential window of battery electrolytes: the HOMO-LUMO misconception. Energy Environ. Sci. 2018, 11 (9), 2306–2309. 10.1039/C8EE01286E. [DOI] [Google Scholar]
  77. Haregewoin A. M.; Wotango A. S.; Hwang B.-J. Electrolyte additives for lithium ion battery electrodes: progress and perspectives. Energy Environ. Sci. 2016, 9 (6), 1955–1988. 10.1039/C6EE00123H. [DOI] [Google Scholar]
  78. Gauthier M.; Carney T. J.; Grimaud A.; Giordano L.; Pour N.; Chang H.-H.; Fenning D. P.; Lux S. F.; Paschos O.; Bauer C.; et al. Electrode-electrolyte interface in Li-ion batteries: current understanding and new insights. J. Phys. Chem. Lett. 2015, 6 (22), 4653–4672. 10.1021/acs.jpclett.5b01727. [DOI] [PubMed] [Google Scholar]
  79. Wang Q.; Yao Z.; Zhao C.; Verhallen T.; Tabor D. P.; Liu M.; Ooms F.; Kang F.; Aspuru-Guzik A.; Hu Y.-S. Interface chemistry of an amide electrolyte for highly reversible lithium metal batteries. Nat. Commun. 2020, 11 (1), 4188. 10.1038/s41467-020-17976-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wang Q.; Zhao C.; Wang J.; Yao Z.; Wang S.; Kumar S. G. H.; Ganapathy S.; Eustace S.; Bai X.; Li B. High entropy liquid electrolytes for lithium batteries. Nat. Commun. 2023, 14 (1), 440. 10.1038/s41467-023-36075-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Li Z.; Rao H.; Atwi R.; Sivakumar B. M.; Gwalani B.; Gray S.; Han K. S.; Everett T. A.; Ajantiwalay T. A.; Murugesan V.; et al. Non-polar ether-based electrolyte solutions for stable high-voltage non-aqueous lithium metal batteries. Nat. Commun. 2023, 14 (1), 868. 10.1038/s41467-023-36647-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Amanchukwu C. V.; Kong X.; Qin J.; Cui Y.; Bao Z. Nonpolar alkanes modify lithium-ion solvation for improved lithium deposition and stripping. Adv. Energy Mater. 2019, 9 (41), 1902116. 10.1002/aenm.201902116. [DOI] [Google Scholar]
  83. Kauwe S. K.; Rhone T. D.; Sparks T. D. Data-driven studies of li-ion-battery materials. Crystals 2019, 9 (1), 54. 10.3390/cryst9010054. [DOI] [Google Scholar]
  84. Hüllermeier E.; Waegeman W. Aleatoric and epistemic uncertainty in machine learning: A tutorial introduction. Mach. Learn. 2021, 110, 457–506. 10.1007/s10994-021-05946-3. [DOI] [Google Scholar]
  85. Lakshminarayanan B.; Pritzel A.; Blundell C.. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems; 2017; Vol. 30.
  86. Petersen M. L.; Molinaro A. M.; Sinisi S. E.; van der Laan M. J. Cross-validated bagged learning. J. Multivar. Anal. 2007, 98 (9), 1693–1704. 10.1016/j.jmva.2007.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Qu X.; Jain A.; Rajput N. N.; Cheng L.; Zhang Y.; Ong S. P.; Brafman M.; Maginn E.; Curtiss L. A.; Persson K. A. The Electrolyte Genome project: A big data approach in battery materials discovery. Comput. Mater. Sci. 2015, 103, 56–67. 10.1016/j.commatsci.2015.02.050. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci3c01030_si_001.pdf (1.2MB, pdf)
ci3c01030_si_002.xlsx (34.1KB, xlsx)
ci3c01030_si_003.xlsx (20.4KB, xlsx)
ci3c01030_si_004.xlsx (28.1KB, xlsx)

Data Availability Statement

The experimental data and simulation data used in this paper can be found along with the Supporting Information. Code used in the study is presented in the Supporting Information. The simulation workflows and tools by Simulation Toolkit for Scientific Discovery (ST4SD) are used to generate simulation data in this study which are now available to the open-source community at https://st4sd.github.io/overview/. Restrictions may apply to the availability of the simulation codes which are protected under the software license.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES