Beyond performance: how design choices shape chemical language models

Inken Fender; Jannik Adrian Gut; Thomas Lemmin

doi:10.1186/s13321-025-01099-w

. 2025 Nov 18;17:173. doi: 10.1186/s13321-025-01099-w

Beyond performance: how design choices shape chemical language models

Inken Fender ^1,^2,^#, Jannik Adrian Gut ^1,^2,^#, Thomas Lemmin ^1,^✉

PMCID: PMC12625328 PMID: 41254742

Abstract

Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.

Graphical Abstract

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-025-01099-w.

Keywords: Large language models, Chemical language models, Interpretability, Machine learning for chemistry, Explainable AI (XAI), SMILES, SELFIES, RoBERTa, BART

Scientific Contribution

This study provides a systematic evaluation of how core design choices shape chemical language models. Although different configurations often produced similar downstream performance, they produced substantial differences in the structure and interpretability of internal representations. For standard prediction tasks, the RoBERTa-based model with atomwise-tokenized SMILES input provides a practical and reliable setup for standard prediction tasks. By elucidating the impact of molecular representation and tokenization strategy, our results offer actionable guidance for the development of more interpretable and chemically informed CLMs.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-025-01099-w.

Introduction

The design and discovery of novel molecules with desired properties are crucial for advances in medicine, materials science, and agriculture. Traditional experimental methods are often time-consuming and expensive, driving the need for efficient computational approaches. As a result, computational methods have become essential tools for accelerating molecular innovation.

Recent advances in deep learning, particularly in natural language processing (NLP), have sparked growing interest in applying language models to molecular data. By representing molecules as sequences, most commonly using SMILES (Simplified Molecular Input Line Entry System) [1] strings, text-based cheminformatics models have demonstrated strong performance across a range of tasks, including prediction of molecular properties, completion of reactions, planning of retrosynthesis, and generation of de novo molecules [2–5]. In several cases, these models have even outperformed earlier cheminformatics methods such as molecular fingerprint-based methods or GNNs [4, 6, 7].

To improve the syntactic robustness of sequence-based molecular representations, SELFIES (Self-Referencing Embedded Strings) [8] was introduced as an alternative to SMILES. Unlike SMILES, which can produce invalid molecules due to syntax errors, every SELFIES string maps to a valid molecule by design. In addition, tokenization strategies play a crucial role in how models interpret molecular sequences. The most common approaches include atomwise tokenization [4], which segments the string based on atoms and bonds, and subword tokenization using SentencePiece [9, 10], which learns data-driven tokens optimized for training efficiency on a finer level.

Ultimately, the performance and interpretability of these models are influenced by several design choices, including molecular representation, tokenization strategies and the architectural backbone. These factors collectively influence both the downstream task performance and the internal representation of the chemical knowledge of the model. However, a complete understanding of the individual and synergistic effects of these components is still lacking, motivating increased research in this area [11, 12]. Although some initial evidence suggests that the choice of molecular representation might not be a primary driver of property prediction performance [2, 13, 14], the impact of tokenization [12] and, crucially, the architectural backbone [15] are still poorly understood and require more in-depth investigation. While non-canonical SMILES have shown benefits in certain settings, prior work recommends canonization when computational resources are limited, as it improves training efficiency without sacrificing performance [16].

In this work, we therefore systematically investigate the impact of three core design choices in large chemical language models (CLMs): molecular representation (SMILES vs. SELFIES), tokenization strategy (atomwise vs. SentencePiece), and model architecture (RoBERTa vs. BART). Our goal is to understand how these choices influence downstream performance, the structure of the latent space, and the chemical interpretability of the learned embeddings. We evaluate models on predictive tasks, probe the organization of their latent representations using probing predictors, vector operations and dimensionality reduction, and examine atom-level embeddings in relation to chemical typing schemes. While downstream task performance is often comparable across configurations, we find that certain setups, particularly those using SMILES with atomwise tokenization, yield more chemically structured embeddings, potentially indicating a deeper internalization of chemical context (Fig. 1).

Fig. 1 — Workflow of the systematic analysis of molecular representation, tokenization strategies, and architecture in chemical language models (CLMs)

Methods

We investigated the performance of Transformer-based models for chemical structure representation. We trained a series of BART and RoBERTa models, exploring the impact of varying tokenization strategies (atomwise [4] and SentencePiece [9, 10]) and molecular representations (SMILES and SELFIES).

Pretraining dataset

The initial dataset was derived from the PubChem-10 M dataset [3], a collection of 10 million chemical structures sourced from PubChem [17]. To ensure the quality and consistency of the SMILES strings, all molecules were canonicalized using the RDKit [18]. For each molecule, the RDKit was employed to generate one potential isomer, with a maximum of ten attempts per molecule. These isomers allow to explicitly model chirality and are added to the implicit chirality molecules to form the explicit chirality dataset. Subsequently, to generate the corresponding SELFIES representations, the canonicalized SMILES were converted to SELFIES using the SELFIES library [8]. To validate the accuracy of the SELFIES conversion and ensure reversibility, the generated SELFIES strings were then back-translated to SMILES using the same library. Only molecules for which the back-translated SMILES matched the original canonicalized SMILES were retained. Due to this procedure, 10,818 (0.1%) out of the ten million molecules in the base dataset were filtered out and 5,460,790 isomers were added with explicit chirality to create the explicit chirality dataset.

Tokenization

Two different tokenization strategies were used: atomwise [4] and SentencePiece [9, 10] (Table 1). The atomwise strategy decomposes SMILES strings into individual atoms and bonds, treating each as a separate token using a regular expression pattern. Alternatively, SentencePiece utilizes a subword tokenization technique, grouping characters within and across atoms based on their frequency of occurrence. To generate the SentencePiece vocabulary, we utilized the Hugging Face library [19], creating a vocabulary of 1,000 subword units. An example tokenisation can be found in Table 1.

Table 1.

Example tokenization of a SMILES string. Vertical bars (|) delimit individual tokens

SMILES	Cc1cc(=O)[nH]c(=S)[nH]1
Atomwise	C \| c \| 1 \| c \| c \| ( \| = \| O \| ) \| [nH] \| c \| ( \| = \| S \| ) \| [nH] \| 1
SentencePiece	C \| c1c \| c(=O)[nH] \| c( \| =S) \| [nH]1

Open in a new tab

Language model description

Two Transformer-based large language model architectures were employed: RoBERTa [20] and BART [21]. Both models leverage the Transformer architecture [22] to capture long-range dependencies within sequential data. RoBERTa (Robustly Optimized BERT Pre-training Approach) is an encoder-only model that builds upon the BERT [23] architecture by improving its hyperparameters through optimized training. BERT (Bidirectional Encoder Representations from Transformers), introduced in 2019, consists solely of the encoder part of the transformer architecture and was pre-trained using two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In contrast, RoBERTa is pre-trained using the masked language modeling objective only, where the model predicts masked tokens within input sequences. RoBERTa demonstrated better performance than BERT and many of its successors due to longer training time and an adapted schedule of masking tokens among others [20]. BART (Bidirectional Auto-Regressive Transformer) is an encoder-decoder model designed for sequence-to-sequence tasks. During pre-training, it uses a denoising autoencoder objective to reconstruct corrupted input sequences. Unlike RoBERTa, corruptions can span longer continuous sections of text, which can also have hidden lengths. BART has achieved state-of-the-art (SOTA) results in question answering and performed on par with RoBERTa on GLUE and SQuAD, while outperforming RoBERTa in other areas. [21]

Both, BART and RoBERTa models, were implemented and trained using the fairseq library [24]. Detailed training parameters, including hyperparameters and optimization settings, are provided in Section S1 in the Supplementary Information.

Downstream tasks

To evaluate the performance of our pre-trained models, we conducted fine-tuning experiments on a series of downstream tasks sourced from the MoleculeNet benchmark suite [25, 26] (Table 2). The following classification tasks were used: BACE, BBBP, ClinTox (CT_TOX), HIV, and Tox21 (SR-p53). For regression tasks, we utilized the BACE, Clearance, Delaney, and Lipo tasks. We employed the default train-validation-test splits provided by MoleculeNet for each dataset. The data processing pipeline mirrored that of the pre-training dataset, with the exception of explicit isomer generation. We note that there are eight samples (0.18%) in the HIV test set that did not pass the SELFIES filters and are omitted in all our tests.

Table 2.

Classification and regression downstream tasks and descriptions from the MoleculeNet benchmark suite[25, 26]

Task	Description	Train/Val/Test
Classification
BACE	Binding result classes for a set of inhibitors of human beta-secretase (BACE-1)	1210/151/152
ClinTox	Classification for drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons	1181/148/148
BBBP(CT_TOX)	Classification results on Blood-Brain Barrier Penetration	1631/204/204
HIV	Classification results on ability to inhibit HIV replication	32,874/4096/4105
Tox21 (SR-p53)	Toxicity classification results of compounds as provided by the “Toxicology in the 21st Century” (Tox21) initiative	6264/783/784
Regression
ACE	Binding results for a set of inhibitors of human beta-secretase (BACE-1)	1210/151/152
Clearance	Results on rate at which the human body removes unbound drugs from the blood	669/84/84
Delaney	Water solubility results of compounds	902/113/113
LIPO	Results of octanol/water distribution coefficient (logD at pH 7.4)	3360/420/420

Open in a new tab

Probing predictors

To analyze the latent spaces of the pre-trained models, we trained three types of probing predictors on frozen embeddings generated from different language model architectures: k-nearest neighbors (KNN) [27], linear support vector machines (Linear SVM), and radial basis function support vector machines (RBF SVM) [28]. All implementations were from scikit-learn [29].

K-nearest neighbours (KNN): We performed a grid search over n_neighbors = [1, 5, 11] and weights = [’uniform’, ’distance’].

Support vector machines (SVMs): Linear and RBF kernels were used. The regularization parameter C was tuned over [0.1, 1, 10]. Linear SVMs were trained with max_iter = 1000 and all classifiers used balanced class weights. SVMs were applied as support vector classifiers (SVCs) or regressors (SVRs) depending on the task.

For all atom-level predictors, data was split at the molecule level, so that all atoms from a given molecule remained in the same split.

Molecule-level prediction tasks

Two sets of molecule-level downstream tasks were considered:

MoleculeNet tasks [25, 26]: We evaluated the performance of each probing predictor for each pre-trained embedding configuration and compared it to the corresponding fine-tuned language model. Performance was quantified using z-scores on the area under the receiver-operating characteristic curve (ROC-AUC). For regression tasks, z-scores of the rectified root mean squared error (RMSE) were inverted so that higher values consistently indicate better performance.

RDKit descriptor tasks: Seven tasks were considered. Two were binary classification tasks, predicting whether a molecule contains at least one heterocycle or one hydrogen-bond donor. The other five were regression tasks, predicting Chi0v [30], Kappa1 [30], the octanol–water partition coefficient (MolLogP) [31], molar refractivity (MolMR) using the Crippen method [31], and the quantitative estimate of drug-likeness (QED) [32].

We constructed balanced datasets from the pre-training dataset for each probing task. For the binary classification tasks, 100,000 molecules were sampled per class (presence/absence). For the regression tasks, 50,000 molecules were sampled. All molecules were annotated with the respective features using the RDKit [18]. Latent embeddings were generated with the pre-trained models and subsequently split into equal-sized training and test sets. To estimate variability, three-fold cross-validation was performed, stratified for the classification tasks [27].

Dimensionality reduction

To visualize high-dimensional embeddings, we applied two dimensionality reduction techniques: principal component analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). PCA and UMAP were implemented using scikit-learn [29] and the umap-learn library [33], respectively.

PCA was used to compute a linear transformation projecting embeddings into two dimensions, capturing the directions of maximum variance.

UMAP was applied to capture potential non-linear structure in the embeddings, projecting them into two dimensions for visualization with 15 n_neighbors and 0.5 min_dist.

PCA was chosen for its interpretability, while UMAP was included to provide insight into potential non-linear clustering in the latent space. Both techniques were used solely for visualization; no quantitative analysis was performed on the reduced embeddings.

Vector operations

We selected nine classes of molecules derived from hydrocarbon chains: alkanes, alcohols, aldehydes, ketones, carboxylic acids, ethers, esters, halocarbons with chloride, and amines. For each class, seven small molecules were chosen, ranging from the smallest molecule in the class up to structures containing a maximum of seven carbon atoms. Cosine similarity between embeddings was computed to assess the relative positions of molecules in the latent space. In addition, we performed vector arithmetic on molecule embeddings using the operation

\begin{matrix} E_{A} - E_{B} + E_{C}, \end{matrix}

where $E_{A}$ , $E_{B}$ , and $E_{C}$ are the embeddings of three selected molecules. The resulting vector was then compared to all other embeddings using cosine similarity to identify the closest matching molecule in the latent space. To avoid trivial results, vector operations in which the closest molecule was already part of the input set were excluded. Finally, linear decomposition of one molecule from the vector operations was conducted using a LASSO regression [34], limiting the decomposition to five components to quantify contributions of individual dimensions to the embedding.

Atom-type assignment and embedding extraction

We randomly selected 4,000 molecules from the pre-training dataset and generated mol2-files for each. Antechamber was then used to assign General AMBER Force Field 2 (GAFF2) atom types [35] to each atom in these molecules, a standard procedure for preparing molecules for molecular dynamics simulations [36]. Atom type assignments were validated with the parmchk2 utility [36], and molecules with flagged atom types were discarded to avoid potential parametrization errors. Atom types were assigned for SELFIES by translating SMILES strings into SELFIES using the SELFIES library with the "attribute=True" flag, which preserves the correspondence between SMILES and SELFIES tokens.

For each molecule, atom-level embedding vectors were extracted from the pre-trained models. To ensure comparability, only atom types consistently assigned across both SMILES and SELFIES representations were retained. The final dataset comprised 365 molecules with consistent atom assignments and corresponding embeddings across models and representations.

Atom-level prediction tasks

The probing predictors were assessed on two types of atom-level tasks, a classification task and five regression tasks. For the classification task, probing classifiers were trained to predict GAFF2 atom types from the embeddings. For regression, probing regressors were trained to predict quantum-mechanical properties from the DASH dataset [37], including Mulliken charges [38], dual descriptors of electro- and nucleophilicity [39], MBIS dipole strengths [40], and RESP partial charges [41]. Only the properties of the first conformation of each molecule as provided in the DASH dataset were hereby used, since SMILES and SELFIES cannot encode specific conformations.

For each task, three cross-validation splits were generated, ensuring that all atoms from the same molecule were assigned to the same split. Stratification was applied for the classification task [27].

Results

Downstream task performance

We first pre-trained 16 distinct models, systematically varying key architectural and data representation parameters. Specifically, we explored combinations of two molecular representations (SMILES [1], SELFIES [8]), two model architectures (BART [21], RoBERTa [20]), two tokenization strategies (atomwise [4], SentencePiece [9]), and two chirality representations (implicit, explicit).

To evaluate the impact of these parameters on the models’ predictive capabilities, we performed fine-tuning experiments on a series of downstream tasks from the MoleculeNet datasets [25] (Table 2) of the DeepChem benchmark suite [26]. Each of the 16 pre-trained models was fine-tuned on five classification tasks (BACE, BBBP, ClinTox, HIV, Tox21) and four regression tasks (BACE, Clearance, Delaney, Lipo) and after hyperparameter tuning, fine-tuning was repeated five times using different random seeds to assess model robustness.

Given the varying difficulty and scales of these downstream tasks, direct comparison of raw performance metrics is challenging. Therefore, we present and discuss the z-scores [42] of the models’ performance (Fig. 2), which were computed to normalize the performance of each model across the different tasks. For classification tasks, z-scores were calculated based on the area under the receiver-operating characteristic curve (ROC-AUC). For regression tasks, z-scores were calculated for each rectified root mean squared error (RMSE) and then averaged. Raw performance metrics and comparisons to benchmark models for classification and regression tasks are available in Supplementary Information Table S3 and Table S4 respectively.

Fig. 2 — Hierarchical clustering of z-scores of downstream tasks of the 16 different model configurations using cosine similarity. Z-scores were calculated from ROC-AUC, except for the aggregated regression score which is based on RMSE. Model configurations are listed, starting from left to right, by molecule embedding, tokenisation, language model architecture and chirality representation (“explicit” as enriched in isomers, “implicit” as default)

Although no single model configuration consistently outperformed all others, hierarchical clustering of the 16 model configurations based on the cosine similarity of their z-score performance profiles, revealed meaningful groupings. The first-level clustering mainly separated models by tokenization strategy. Notably, models employing atomwise tokenization generally exhibited superior performance compared to those using SentencePiece, with the exception of the Tox21 task.

At the next level of clustering, models further grouped according to the molecular representation used. SMILES representations tended to yield better performance on the BBBP and HIV tasks as well as aggregated regression scores, whereas SELFIES representations showed advantages on the ClinTox task. Explicit chirality representation was associated with improved performance on the BACE, ClinTox, HIV, Tox21, and BBBP tasks.

The model architecture influenced performance. BART models demonstrated better results on the BACE, HIV, and Tox21 tasks, while RoBERTa models achieved higher performances on the ClinTox and BBBP task.

Based on Wilcoxon signed-rank tests [43] (Table S5 in the Supplementary Information), we prioritized atomwise tokenization due to its statistically significant better performance (p = 0.020) and implicit chirality representation, which showed no significant detriment (p = 0.123), for further analysis. We further investigated BART and RoBERTa architectures with both SMILES and SELFIES.

When comparing the z-scores of probing predictors on frozen embeddings with those of fine-tuned models, the impact of fine-tuning was found to be highly task-dependent (Fig. 3a and Table S6 and Table S7 in the Supplementary Information). For some tasks, fine-tuning yielded only marginal improvements over the best probing predictors (e.g. BACE classification). Nevertheless, the highest score for each task was consistently obtained with a fine-tuned configuration, although in several cases the margin over the probing baseline was minimal (e.g. ClinTox). In contrast, other tasks showed clear improvements with fine-tuning, likely reflecting distributional shifts between the pre-training and downstream task’s data or training objectives (e.g. HIV). The strongest effect was observed for the BART architecture with SELFIES (implicit chirality) and SentencePiece tokenization on the Clearance task. Notably, this task is based on the smallest dataset, where individual outliers are likely to exert a disproportionate influence on performance.

Fig. 3 — Performance of three probing predictors trained on embeddings of pre-trained language model architectures and molecule representation with the atomwise tokeniser and implicit chirality representation. a) Comparison of best probing predictor per pre-trained embedding configuration versus fine-tuned results of language models on MoleculeNet z-scores. Regression scores have been inverted, therefore a higher z-score is better for every task. b) Results of probing predictors of four RDKit descriptor tasks; classification if there is at least one assigned heterocycle or H-Donor in the molecule and regression on the assigned Kappa1 and MolLogP values. Bars indicate the standard deviation across three cross validation splits

Latent space analysis

To evaluate whether the pre-trained models captured chemically meaningful features in their latent spaces, we examined how well they encoded seven molecular properties. Two classification tasks were considered: detecting the presence of at least one heterocycle and identifying molecules with at least one hydrogen-bond donor. In addition, five regression tasks were performed: predicting the topological indices Chi0v [30] and Kappa1 [30], the octanol–water partition coefficient (MolLogP) and molar refractivity (MolMR) using the Crippen method [31], and the quantitative estimate of drug-likeness (QED) [32], which is computed from the predicted LogP and related descriptors.

BART embeddings slightly outperformed RoBERTa on both classification tasks and the MolLogP regression task (Fig. 3b and Figure S2 in the Supplementary Information). In contrast, RoBERTa achieved higher average scores on Kappa1, Chi0v, and MolMR, although these differences were accompanied by larger standard deviations. For QED, performance was comparable, with similar overall scores and different predictors yielding the highest values.

SMILES representations slightly outperformed SELFIES on the classification tasks and the QED task, whereas SELFIES yielded lower errors for MolLogP, MolMR, Kappa1, and Chi0v. As previously, the larger apparent performance gains for SELFIES are accompanied by higher standard deviations, indicating greater variability.

The RBF-SVM consistently achieved the highest performance across all models and tasks, indicating that the latent space encodes complex, non-linear relationships that effectively separate molecules based on target features. The choice of the second-best predictor varied by task: KNN performed better for the topological descriptors and MolMR, whereas Linear SVM was superior for the classification tasks and MolLogP.

Based on observed performance patterns, we grouped the tasks into three sets. The first set, which includes the classification tasks and MolLogP, shows a weak preference for Linear SVM over KNN, BART over RoBERTa, and SMILES over SELFIES. The second set, consisting of the topological descriptors Chi0v, Kappa1, and MolMR, favors KNN over Linear SVM, RoBERTa over BART, and SELFIES over SMILES, although standard deviations are higher. The third set comprises QED, where performance is generally comparable across models and predictors.

Molecule embeddings

For a qualitative assessment of the learned embedding space, we selected 64 molecules from each of the following four chemical classes: steroids, beta-lactams, tropanes, and sulfonamides. These chemical classes were chosen because they all have some common use in pharmacology and yet have distinct structures and chemical features. The selected molecules were embedded using the atomwise tokenizer with implicit chirality representation.

SMILES embeddings exhibited the more distinct clustering in the PCA [27] (Fig. 4) and UMAP [33] (Figure S1 in the Supplementary Information) plot, with well-defined chemical families showing different degrees of separation, while SELFIES embeddings also demonstrated structured clustering, but to a lesser degree. RoBERTa-based SMILES embeddings showed the most definition with BART-based SMILES embeddings coming second, the difference in the SELFIES embeddings is less pronounced. The first principal component already is a usable, although not great classifier in the SMILES based embeddings. The amount of variance across the different principal components is comparable, but the first principal of the BART-based SMILES embedding is highest with 26%, what is in line with previous results.

To ensure that the observed clustering patterns arise from meaningful learned features rather than artifacts of the embedding process or dimensionality reduction techniques, we evaluated untrained models as a control. In the untrained PCA projection, many sulfonamides form a distinct cluster in the SMILES embedding, likely due to the presence of sulphur, a feature absent in the other molecules, making it identifiable even without training. However, the remaining molecular classes do not exhibit clear separation, with their embeddings appearing mixed. Similarly, the UMAP projection shows weak clustering tendencies, but no well-defined clusters emerge.

Latent space vector operations

To further characterize the models and their representation of molecular patterns, we analyzed the cosine similarities between embeddings of molecules from nine classes of hydrocarbon chains and their derivatives: alkanes, alcohols, aldehydes, ketones, carboxylic acids, ethers, esters, halocarbons with chloride, and amines (Fig. 5). For each class, seven molecules were selected, from the smallest member of the class to structures with a maximum of seven carbon atoms.

Fig. 5 — Heatmaps of cosine similarity of embeddings between molecules of different classes for 6 different models: RoBERTa, BART, as well as an untrained BART model with representations SMILES or SELFIES. Each molecule class consists of 7 different molecules

In general BART embeddings, trained on SMILES or SELFIES, show a wider range in cosine similarity than RoBERTa models. The RoBERTa models, even compared to the untrained models, show least diverse similarity values with most similarities around 1 (yellow). The diagonal of self-similarity is less distinct, especially for RoBERTa trained on SELFIES; SMILES-trained RoBERTa displays a finer more heterogeneous pattern. As expected, box patterns for molecules of the same class, e.g. between molecules of carboxylic acids, can be seen for all models; most faintly for the RoBERTa models as the narrow range of similarity values makes it harder to spot. Patterns of higher intragroup similarity are already visible to some extent for the untrained models, highlighting the richness of the representation even before training. The untrained embeddings, as expected, show more randomness than trained embeddings, also showing off-center diagonal lines of higher similarity for molecules of the same length rather than molecule class. These patterns are most striking for untrained models, unsurprising, as molecule length is a feature that does not need chemical understanding to be encoded. More faintly, these patterns can also be seen within groups in the pre-trained models, where there is a band of higher similarities going from the top-left to bottom-right corner of each box comparing different groups. Although slight, this pattern is more noticeable in RoBERTa based models. The main pattern observed in the untrained model remains visible in both BART and RoBERTa models for both representations, e.g. a box structure framed by lines of less similarity can be seen going from alcohols to halocarbons, clearly showing that all models find a striking difference going from pure carbon alkanes to oxygen-containing carbon groups, to halocarbons with chloride. As difference between representations, we can see similar shapes of higher similarity values between aldehydes and ketones for SELFIES, but a bigger box size structure for SMILES additionally including alcohols, and carboxylic acids.

Notably, a cross pattern for formic acid, the smallest, most simple carboxylic acid, can be seen for all models as it does not show high cosine similarity to any other molecule embedding, including other carboxylic acids in its set.

To further explore how molecular relationships are encoded in the embedding space, we applied vector operations to the molecule embeddings. Such operations, commonly used in natural language processing, test whether semantic or structural relationships can be recovered through arithmetic in the latent space. Specifically, we used operations of the form

\begin{matrix} E_{A} - E_{B} + E_{C}, \end{matrix}

where $E_{A}$ , $E_{B}$ , and $E_{C}$ denote the embeddings of three selected molecules. The resulting vector was compared against all embeddings in the dataset, and the molecule with the closest cosine similarity was taken as the predicted outcome (Figure S3 in the Supplementary Information). We found that while simple operations, e.g. the straightforward, but drastic exchange of a hydroxyl group to a chloride group, would yield expected results for all, even untrained models, more difficult operations including longer carbon chains would result in less models returning expected outcomes. To evaluate whether vector operations in embedding space could reflect systematic chemical transformations, we tested two cases where an aldehyde should be converted into the corresponding carboxylic acid. In the first example, the operation was expected to yield butyric acid from butanal. All models retrieved a carboxylic acid as the closest embedding, but only the BART models produced the correct butyric acid, whereas untrained models failed to return acids altogether. In a second example, extending the chain length by one carbon should have produced pentanoic acid from pentanal. Here, BART models again returned butyric acid. Interestingly, the RoBERTa model with SELFIES predicted the wrong acids in both cases, they nevertheless reflected the intended pattern. In the second case, the predicted acid was extended by one carbon compared to the first, consistent with the expected chain-length modification.

To further investigate how molecule embeddings capture structural relationships, we performed a linear decomposition of the butyric acid embedding (Figure S4 in the Supplementary Information). For BART, the decomposition identified the next longest carboxylic acid as the primary component and the next shorter acid as the secondary component. For RoBERTa trained on SMILES, this order was reversed, whereas the SELFIES-trained RoBERTa model returned a ketone of the same chain length as the target acid as its primary component. Aside from this ketone, all components suggested by the RoBERTa models remained within the carboxylic acid class. BART models also included minor contributions from other functional groups, such as aldehydes, amines, or esters, albeit with very low weights.

Atom type embeddings

To further evaluate the degree of chemical understanding acquired by our pre-trained models, we visualized the similarity of atom embeddings based on GAFF2 atom types [36] using PCA. The aim was to determine whether atoms sharing the same GAFF2 atom type are represented by similar embedding vectors and whether their clustering differs between the different models. For this analysis, we focussed on the most commonly occurring atom types.

For carbon atoms, PCA of the embeddings reveals distinct clustering patterns across molecular representations and architectures (Fig. 6). In the RoBERTa model trained on SMILES, aromatic carbons (GAFF2 atom type: ca) are effectively separated from sp²-hybridized aliphatic or ketone/thioketone carbons (GAFF2: c2 and c) and sp³ carbons (GAFF2: c3). A similar, albeit not as strong separation can be seen for BART trained on SMILES as well. This model additionally separates c2 atom types from other atom types more clearly than the RoBERTa models. For SELFIES-trained models, as opposed to SMILES-trained models, there is no distinct separation and all atom types overlap. When comparing the untrained models as a baseline, we observe that SELFIES again show no distinct clustering, while aromatic carbon atom types (ca) are clearly clustered when using SMILES.

For nitrogen atoms, the embeddings reveal that the amide nitrogen with one attached hydrogen (GAFF2: ns) clusters distinctly in the RoBERTa SMILES model along the second principal component (Figure S5 in the Supplementary Information). For RoBERTa SELFIES and BART SMILES this clustering of atom type ns is visible, but less distinct. For all other atom types, and specifically for BART trained on SELFIES, dimensionality reduction of atom types does not yield clusters. Notably, since all the remaining atom types (na, nb, and nd) are sp²-hybridized, they are structurally quite similar, which may make further clustering either challenging or impractical. In contrast, the untrained models, regardless whether they are trained on SMILES or SELFIES, do not effectively cluster any nitrogen atom types.

For oxygen atoms, the sp²-hybridized oxygen typically found in carbonyl groups (GAFF2: o) is surprisingly largely separated only for the RoBERTa SELFIES model that also clusters the sp³-hybridized atom types oh and os. Besides, less distinct clustering of sp²-hybridized oxygen atom type ’o’ can only be seen for the BART SMILES model (Figure S6 in the Supplementary Information). In most other configurations all atom types overlap greatly, with some separation of oxygen in ethers and esters (GAFF2: os) visible only for the RoBERTa SMILES model. SMILES-based models show a slight tendency to group atom types along the second principal component. Comparing to the untrained models, again we find that where SELFIES reveal no distinct clustering, SMILES enable the grouping of sp²-hybridized oxygen ’o’ even before training. This early clustering especially in carbon atom types likely reflects inherent signals present in SMILES strings, such as lowercase characters denoting aromatic atoms and symbols like "=" indicating bond order, which can implicitly encode hybridization or bonding environments.

We evaluated the impact of case sensitivity by modifying molecules to use only uppercase carbon atoms via the kekulize flag, thereby removing the distinction between aromatic (lowercase) and aliphatic (uppercase) carbons. As expected, this led to less distinct clustering of carbon atom types in the embedding space (Figure S6 in the Supplementary Information). As expected, given the loss of case-based cues, the models were not able to differentiate aromatic carbons (ca) from sp²-hybridized (c and c2) and sp³-hybridized (c3) carbons. As previously, for nitrogens the RoBERTa model was still able to cluster atom type ns from other atom types as kekulization would only affect aromatic atoms. Surprisingly, for the BART model, the clustering of nitrogen atom types ns disappeared. For oxygen atom types, the BART model showed a similar degree of clustering of the three different atom types along component one as previously for component two. The RoBERTa model again showed only grouping of the ’os’ atom type.

Atom-level predictions

Finally, to quantify how well the embeddings capture atom-level chemical information, we applied the probing predictors to a series of atom-level tasks (Fig. 7 and Figure S8 and Figure S9 in the Supplementary Information). Specifically, we assessed the ability of KNN, Linear SVC, and RBF SVC to recover GAFF2 atom types, as previously assigned. In addition, we used the same predictors to estimate quantum-chemical properties, including Mulliken charges, the Dual descriptor, RESP partial charges, and MBIS dipole strengths, across all four model types.

Fig. 7 — Probing predictors results of atom-level tasks. Top left is the combined GAFF2 annotation task across all elements of the previous dataset. The other three plots show RMSE of regression tasks based on the DASH properties dataset. Bars show standard deviation across three fold (stratified) cross validation

Across nearly all atom-level tasks, SMILES embeddings consistently outperformed SELFIES, achieving the highest accuracies for GAFF2 atom type classification and the lowest RMSEs for Mulliken charges, as well as for most Dual descriptor and MBIS dipole strength regressions. RoBERTa embeddings generally surpassed BART across all tasks and predictors. Notably, KNN emerged as the strongest predictor for these atom-level tasks, followed by RBF SVM and then Linear SVM. Overall, the performance differences between models and predictors were more pronounced at the atom level than at the molecule level.

The superior performance of KNN on atom-level tasks highlights the importance of local neighborhood information, in addition to the atom type itself. This is expected, as the embeddings capture both the element identity and its surrounding chemical context, which are critical features for predicting the atom-level properties considered here.

Discussion

Chemical language models encode molecular and atom-level information in complex latent spaces, but the extent to which design choices influence these embeddings remains unclear. To investigate this, we systematically compared multiple model configurations, varying molecular representations, tokenization strategies, and architectures, and examined their impact on both downstream predictive performance and the structure of latent embeddings.

Our comparative analysis systematically examined multiple aspects of model performance and latent space structure. We began by assessing the performance of each model configuration on downstream tasks following fine-tuning, establishing a baseline to explore how tokenization, molecular representation, and architecture influence embedding quality. SMILES and SELFIES representations produced largely comparable results, with SMILES showing marginally better performance in some tasks, a difference confirmed by Wilcoxon signed-rank tests. Models using atomwise tokenization and SentencePiece performed similarly well, with a slight advantage for atomwise tokenization. BART and RoBERTa architectures exhibited overall similar performance, with task-dependent variations but no configuration consistently surpassing all others. These findings suggest that the models learn robust internal representations that generalize across diverse input formats and preprocessing strategies, demonstrating the flexibility of these architectures in adapting to varied chemical representations. Task-specific factors, such as training on datasets enriched in isomers, further enhanced performance when directly relevant to a given task.

We next assessed the performance of probing predictors trained on frozen embeddings for the same downstream tasks. We observed that the benefit of fine-tuning is highly dependent on the specific task and, likely, the relationship between pretraining and downstream data. Importantly, correct fine-tuning was never found to be detrimental, even when the performance gains were modest.

The analysis of latent space embeddings reveals that different molecular representations and model architectures capture complementary aspects of chemical information. Tasks emphasizing global molecular features, such as MolLogP or the presence of heterocycles and hydrogen-bond donors, tend to benefit from architectures that can integrate information across the entire molecule, exemplified by the slight advantage of BART with SMILES. In contrast, predictions of descriptors that rely on local connectivity or atom-level contributions, including Kappa1, Chi0v, and MolMR, show improved performance with SELFIES representations and RoBERTa architectures when paired with KNN or RBF SVM predictors that emphasize local neighborhood information. The greater variability observed in these tasks may reflect sensitivity to subtle structural differences that strongly influence atom-level descriptors. Collectively, these patterns suggest that embeddings may encode complementary structural features depending on the representation and model, indicating that the choice of representation and architecture influences how chemical information is captured and interpreted.

The cosine similarities between molecules from different chemical classes showed that BART embeddings tended to produce more clearly separated clusters than RoBERTa. This confirms differences in how these architectures structure chemical space, with variations observed between SMILES and SELFIES representations. Across all models, including untrained ones, molecule length emerged as a feature encoded within the embeddings, appearing slightly more pronounced in RoBERTa. Vector operations further revealed that embeddings capture logical chemical relationships, although no single model or representation consistently outperformed the others. While these analyses are limited by the small set of vector operations and the constrained molecule space, they offer a glimpse into the types of structural and relational information that can be extracted from latent spaces. We encourage the interested reader to explore further relations on their own following our supplied code. Examining atomic-level embeddings highlights how different representations influence the granularity of chemical information captured by the models. The initial clustering observed even in untrained embeddings, particularly for distinctions like lowercase versus uppercase carbon in SMILES, suggests that certain structural signals are inherently encoded by the representation itself. Training further sharpens these distinctions, allowing models to differentiate subtle variations in atomic environments. The combination of RoBERTa and SMILES appears particularly effective at capturing these fine-grained patterns. The strong performance of KNN on atom-level tasks indicates that the pre-trained embeddings capture essential atomic features, including element identity and aspects of the local chemical environment. This suggests that these embeddings already encode much of the information necessary for atom type classification, allowing a relatively simple, locality-focused method like KNN to perform effectively. In comparison, LinearSVMs, which rely on global decision boundaries, may struggle to reconcile the naturally clustered structure of atoms in embedding space. The slightly better performance observed with SMILES representations likely reflects the additional chemical information they encode, such as aromaticity, which enriches the embeddings and enhances the model’s ability to distinguish subtle variations in atomic environments.

Finally, with respect to preprocessing, we observed that SentencePiece tokenization can reduce training time due to shorter sequence lengths. However, this efficiency comes at a cost: the resulting subword units are less directly interpretable than the atomwise tokens, making it harder to trace chemical information in the embeddings.

Conclusion

Our systematic evaluation of chemical language models highlights the nuanced ways in which design choices shape the structure and chemical fidelity of latent embeddings. By isolating and evaluating the effects of each of these variables across a range of molecule- and atom-level tasks, we aimed to clarify how architectural and preprocessing choices shape the ability of chemical language models to learn chemically meaningful embeddings. While downstream predictive performance was often similar across configurations, probing analyses revealed substantial differences in how atomic and molecular features are captured and organized.

Our results indicate that, for predictive tasks, embeddings generated from SMILES and SELFIES representations, as well as from BART and RoBERTa architectures, are largely comparable and interpretable. Subtle differences in the embedding space between the different models suggest that design choices can influence specific aspects of the learned chemical representations. Overall, these findings suggest that multiple configurations can yield robust and chemically meaningful embeddings, providing flexibility in model selection depending on the intended application.

Looking forward, several avenues could further advance this line of research. Exploring alternative molecular representations, such as t-SMILES [44] or DeepSMILES [45], could yield more robust and generalizable models. A direct comparison of BART’s generative capabilities with other large language model architectures, such as GPT [46], would clarify the relative strengths of these approaches. Investigating potential synergies between language models and graph-based architectures, like GROVER [47], could expand the methodological toolkit for molecular modeling. Finally, leveraging highly curated datasets from specialized chemical domains may reveal subtle structure–property relationships and drive progress in targeted applications.

Supplementary Information

Supplementary file 1.^{(3.9MB, pdf)}

Acknowledgements

We thank Noah Kleinschmidt for careful review of the paper and valuable feedback on the project.

Author contributions

Inken Fender: conceptualisation, data curation, formal analysis, investigation, methodology, software, validation, visualisation, writing. Jannik Adrian Gut: conceptualisation, data curation, formal analysis, investigation, methodology, software, validation, visualisation, writing. Thomas Lemmin: conceptualisation, investigation, funding acquisition, methodology, supervision, visualisation, writing.

Funding

This work is supported by funds from the FreeNovation 2023 grant and the Swiss National Science Foundation (PCEFP3 194,606).

Data availability

Pre-training data is taken from PubChem dataset [2] and fine-tuning dataset is taken from MoleculeNet [25, 26]. The atom-level dataset was supplied by DASH properties[37]. Implementation code is available on GitHub: https://github.com/ibmm-unibe-ch/SMILES_or_SELFIES. Pre-trained models and tokenizer can be found on Zenodo: https://zenodo.org/records/16926537.

Declarations

Competing interests

None declared.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Inken Fender and Jannik Adrian Gut have contributed equally to this work.

References

1.Weininger D (1988) Smiles, a chemical language and information system. J Chem Inf Comput Sci 28:31–36 [Google Scholar]
2.Chithrananda S, Grand G, Ramsundar B (2020) Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885
3.Ahmad W, Simon E, Chithrananda S, Grand G, Ramsundar B (2022) Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712
4.Schwaller P et al (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent Sci 5:1572–1583 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chilingaryan G et al (2024) Bartsmiles: generative masked language models for molecular representations. J Chem Inf Model 64:5832–5843 [DOI] [PubMed] [Google Scholar]
6.Sadeghi S, Bui A, Forooghi A, Lu J, Ngom A (2024) Can large language models understand molecules? BMC Bioinformatics 25:225 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ross J et al (2022) Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4:1256–1264 (https://www.nature.com/articles/s42256-022-00580-7) [Google Scholar]
8.Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Machine Learning Science and Technology 1:045024 [Google Scholar]
9.Kudo T, Richardson J, Blanco E, Lu W (eds) (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. (eds Blanco E, Lu W) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). https://aclanthology.org/D18-2012/
10.Li X, Fourches D (2021) Smiles pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model 61:1560–1569 [DOI] [PubMed] [Google Scholar]
11.Krenn M et al. (2022) Selfies and the future of molecular string representations. Patterns3 [DOI] [PMC free article] [PubMed]
12.Leon M, Perezhohin Y, Peres F, Popovič A, Castelli M (2024) Comparing smiles and selfies tokenization for enhanced chemical language modeling. Sci Rep 14:25016 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Yüksel A, Ulusoy E, Ünlü A, Doğan T (2023) Selformer: molecular representation learning via selfies language models. Machine Learning Science and Technology 4:025035 [Google Scholar]
14.Flam-Shepherd D, Zhu K, Aspuru-Guzik A (2022) Language models can learn complex molecular distributions. Nat Commun 13:3293 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Sultan A, Sieg J, Mathea M, Volkamer A (2024) Transformers for molecular property prediction: lessons learned from the past five years. J Chem Inf Model 64:6259–6280 [DOI] [PubMed] [Google Scholar]
16.Kimber TB, Gagnebin M, Volkamer A (2021) Maxsmi: maximizing molecular property prediction performance with confidence estimation using smiles augmentation and deep learning. Artificial Intelligence in the Life Sciences 1:100014 [Google Scholar]
17.Kim S et al (2023) Pubchem 2023 update. Nucleic Acids Res 51:D1373–D1380 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Landrum G et al. (2025) rdkit/rdkit: 2024_09_5 (q3 2024) release. 10.5281/zenodo.14779836
19.Wolf T et al. (2020) Transformers: State-of-the-art natural language processing 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
20.Liu Y et al. (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
21.Lewis M et al. (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 7871–7880
22.Vaswani A et al. (2017) Attention is all you need. Advances in neural information processing systems30
23.Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) 4171–4186
24.Ott M et al. (2019) fairseq: A fast, extensible toolkit for sequence modeling
25.Wu Z et al. (2017) Moleculenet: A benchmark for molecular machine learning. CoRRabs/1703.00564 [DOI] [PMC free article] [PubMed]
26.Ramsundar B et al. (2019) Deep Learning for the Life Sciences (O’Reilly Media)
27.Bishop CM (1995) Neural networks for pattern recognition (Oxford university press)
28.Cortes C, Vapnik V (1995) Support-vector networks Machine learning 20:273–297 [Google Scholar]
29.Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 [Google Scholar]
30.Hall LH, Kier LB (1991) The molecular connectivity chi indexes and kappa shape indexes in structure-property modeling. Reviews in computational chemistry 367–422
31.Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39:868–873 [Google Scholar]
32.Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) Quantifying the chemical beauty of drugs. Nat Chem 4:90–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Healy J, McInnes L (2024) Uniform manifold approximation and projection. Nature Reviews Methods Primers 4:82. 10.1038/s43586-024-00363-x [Google Scholar]
34.Tibshirani R (2018) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58:267–288. 10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]
35.He X, Man VH, Yang W, Lee T-S, Wang J (2020) A fast and high-quality charge model for the next generation general amber force field. J Chem Phys 153:114502. 10.1063/5.0019056 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Wang J, Wang W, Kollman PA, Case DA (2006) Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model 25:247–260 [DOI] [PubMed] [Google Scholar]
37.Lehner MT, Katzberger P, Maeder N, Landrum GA, Riniker S (2024) Dash properties: Estimating atomic and molecular properties from a dynamic attention-based substructure hierarchy. The Journal of chemical physics161 [DOI] [PubMed]
38.Mulliken RS (1955) Electronic population analysis on lcao-mo molecular wave functions. i. The Journal of chemical physics 23:1833–1840 [Google Scholar]
39.Morell C, Grand A, Toro-Labbé A (2005) New dual descriptor for chemical reactivity. J Phys Chem A 109:205–212 [DOI] [PubMed] [Google Scholar]
40.Verstraelen T et al (2016) Minimal basis iterative stockholder: atoms in molecules for force-field development. J Chem Theory Comput 12:3894–3912 [DOI] [PubMed] [Google Scholar]
41.Bayly CI, Cieplak P, Cornell W, Kollman PA (1993) A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the resp model. J Phys Chem 97:10269–10280 [Google Scholar]
42.Kreyszig E, Stroud K, Stephenson G (2008) Advanced engineering mathematics. Integration 9:1014 [Google Scholar]
43.Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics: Methodology and distribution 196–202
44.Wu J-N et al (2024) T-smiles: a fragment-based molecular representation framework for de novo ligand design. Nat Commun 15:4993 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.O’Boyle N, Dalke A (2018) Deepsmiles: An adaptation of smiles for use in machine-learning of chemical structures. chemrxiv preprint chemrxiv:7097960
46.Achiam J et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
47.Rong Y et al (2020) Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst 33:12559–12571 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file 1.^{(3.9MB, pdf)}

Data Availability Statement

[CR1] 1.Weininger D (1988) Smiles, a chemical language and information system. J Chem Inf Comput Sci 28:31–36 [Google Scholar]

[CR2] 2.Chithrananda S, Grand G, Ramsundar B (2020) Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885

[CR3] 3.Ahmad W, Simon E, Chithrananda S, Grand G, Ramsundar B (2022) Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712

[CR4] 4.Schwaller P et al (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent Sci 5:1572–1583 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Chilingaryan G et al (2024) Bartsmiles: generative masked language models for molecular representations. J Chem Inf Model 64:5832–5843 [DOI] [PubMed] [Google Scholar]

[CR6] 6.Sadeghi S, Bui A, Forooghi A, Lu J, Ngom A (2024) Can large language models understand molecules? BMC Bioinformatics 25:225 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Ross J et al (2022) Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4:1256–1264 (https://www.nature.com/articles/s42256-022-00580-7) [Google Scholar]

[CR8] 8.Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Machine Learning Science and Technology 1:045024 [Google Scholar]

[CR9] 9.Kudo T, Richardson J, Blanco E, Lu W (eds) (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. (eds Blanco E, Lu W) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). https://aclanthology.org/D18-2012/

[CR10] 10.Li X, Fourches D (2021) Smiles pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model 61:1560–1569 [DOI] [PubMed] [Google Scholar]

[CR11] 11.Krenn M et al. (2022) Selfies and the future of molecular string representations. Patterns3 [DOI] [PMC free article] [PubMed]

[CR12] 12.Leon M, Perezhohin Y, Peres F, Popovič A, Castelli M (2024) Comparing smiles and selfies tokenization for enhanced chemical language modeling. Sci Rep 14:25016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Yüksel A, Ulusoy E, Ünlü A, Doğan T (2023) Selformer: molecular representation learning via selfies language models. Machine Learning Science and Technology 4:025035 [Google Scholar]

[CR14] 14.Flam-Shepherd D, Zhu K, Aspuru-Guzik A (2022) Language models can learn complex molecular distributions. Nat Commun 13:3293 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Sultan A, Sieg J, Mathea M, Volkamer A (2024) Transformers for molecular property prediction: lessons learned from the past five years. J Chem Inf Model 64:6259–6280 [DOI] [PubMed] [Google Scholar]

[CR16] 16.Kimber TB, Gagnebin M, Volkamer A (2021) Maxsmi: maximizing molecular property prediction performance with confidence estimation using smiles augmentation and deep learning. Artificial Intelligence in the Life Sciences 1:100014 [Google Scholar]

[CR17] 17.Kim S et al (2023) Pubchem 2023 update. Nucleic Acids Res 51:D1373–D1380 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Landrum G et al. (2025) rdkit/rdkit: 2024_09_5 (q3 2024) release. 10.5281/zenodo.14779836

[CR19] 19.Wolf T et al. (2020) Transformers: State-of-the-art natural language processing 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[CR20] 20.Liu Y et al. (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

[CR21] 21.Lewis M et al. (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 7871–7880

[CR22] 22.Vaswani A et al. (2017) Attention is all you need. Advances in neural information processing systems30

[CR23] 23.Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) 4171–4186

[CR24] 24.Ott M et al. (2019) fairseq: A fast, extensible toolkit for sequence modeling

[CR25] 25.Wu Z et al. (2017) Moleculenet: A benchmark for molecular machine learning. CoRRabs/1703.00564 [DOI] [PMC free article] [PubMed]

[CR26] 26.Ramsundar B et al. (2019) Deep Learning for the Life Sciences (O’Reilly Media)

[CR27] 27.Bishop CM (1995) Neural networks for pattern recognition (Oxford university press)

[CR28] 28.Cortes C, Vapnik V (1995) Support-vector networks Machine learning 20:273–297 [Google Scholar]

[CR29] 29.Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 [Google Scholar]

[CR30] 30.Hall LH, Kier LB (1991) The molecular connectivity chi indexes and kappa shape indexes in structure-property modeling. Reviews in computational chemistry 367–422

[CR31] 31.Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39:868–873 [Google Scholar]

[CR32] 32.Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) Quantifying the chemical beauty of drugs. Nat Chem 4:90–98 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Healy J, McInnes L (2024) Uniform manifold approximation and projection. Nature Reviews Methods Primers 4:82. 10.1038/s43586-024-00363-x [Google Scholar]

[CR34] 34.Tibshirani R (2018) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58:267–288. 10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]

[CR35] 35.He X, Man VH, Yang W, Lee T-S, Wang J (2020) A fast and high-quality charge model for the next generation general amber force field. J Chem Phys 153:114502. 10.1063/5.0019056 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Wang J, Wang W, Kollman PA, Case DA (2006) Automatic atom type and bond type perception in molecular mechanical calculations. J Mol Graph Model 25:247–260 [DOI] [PubMed] [Google Scholar]

[CR37] 37.Lehner MT, Katzberger P, Maeder N, Landrum GA, Riniker S (2024) Dash properties: Estimating atomic and molecular properties from a dynamic attention-based substructure hierarchy. The Journal of chemical physics161 [DOI] [PubMed]

[CR38] 38.Mulliken RS (1955) Electronic population analysis on lcao-mo molecular wave functions. i. The Journal of chemical physics 23:1833–1840 [Google Scholar]

[CR39] 39.Morell C, Grand A, Toro-Labbé A (2005) New dual descriptor for chemical reactivity. J Phys Chem A 109:205–212 [DOI] [PubMed] [Google Scholar]

[CR40] 40.Verstraelen T et al (2016) Minimal basis iterative stockholder: atoms in molecules for force-field development. J Chem Theory Comput 12:3894–3912 [DOI] [PubMed] [Google Scholar]

[CR41] 41.Bayly CI, Cieplak P, Cornell W, Kollman PA (1993) A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the resp model. J Phys Chem 97:10269–10280 [Google Scholar]

[CR42] 42.Kreyszig E, Stroud K, Stephenson G (2008) Advanced engineering mathematics. Integration 9:1014 [Google Scholar]

[CR43] 43.Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics: Methodology and distribution 196–202

[CR44] 44.Wu J-N et al (2024) T-smiles: a fragment-based molecular representation framework for de novo ligand design. Nat Commun 15:4993 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.O’Boyle N, Dalke A (2018) Deepsmiles: An adaptation of smiles for use in machine-learning of chemical structures. chemrxiv preprint chemrxiv:7097960

[CR46] 46.Achiam J et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774

[CR47] 47.Rong Y et al (2020) Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst 33:12559–12571 [Google Scholar]

PERMALINK

Beyond performance: how design choices shape chemical language models

Inken Fender

Jannik Adrian Gut

Thomas Lemmin

Abstract

Graphical Abstract

Supplementary Information

Scientific Contribution

Supplementary Information

Introduction

Fig. 1.

Methods

Pretraining dataset

Tokenization

Table 1.

Language model description

Downstream tasks

Table 2.

Probing predictors

Molecule-level prediction tasks

Dimensionality reduction

Vector operations

Atom-type assignment and embedding extraction

Atom-level prediction tasks

Results

Downstream task performance

Fig. 2.

Fig. 3.

Latent space analysis

Molecule embeddings

Fig. 4.

Latent space vector operations

Fig. 5.

Atom type embeddings

Fig. 6.

Atom-level predictions

Fig. 7.

Discussion

Conclusion

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases