Skip to main content
Communications Chemistry logoLink to Communications Chemistry
. 2025 Apr 11;8:114. doi: 10.1038/s42004-025-01484-4

Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa

Georgios Kallergis 1,2, Ehsannedin Asgari 1,9, Martin Empting 3,4,5, Anna K H Hirsch 4,5,6, Frank Klawonn 7,8, Alice C McHardy 1,2,4,
PMCID: PMC11992043  PMID: 40216964

Abstract

Computational techniques for predicting molecular properties are emerging as key components for streamlining drug development, optimizing time and financial investments. Here, we introduce ChemLM, a transformer language model for this task. ChemLM leverages self-supervised domain adaptation on chemical molecules to enhance its predictive performance. Within the framework of ChemLM, chemical compounds are conceptualized as sentences composed of distinct chemical ‘words’, which are employed for training a specialized chemical language model. On the standard benchmark datasets, ChemLM either matched or surpassed the performance of current state-of-the-art methods. Furthermore, we evaluated the effectiveness of ChemLM in identifying highly potent pathoblockers targeting Pseudomonas aeruginosa (PA), a pathogen that has shown an increased prevalence of multidrug-resistant strains and has been identified as a critical priority for the development of new medications. ChemLM demonstrated substantially higher accuracy in identifying highly potent pathoblockers against PA when compared to state-of-the-art approaches. An intrinsic evaluation demonstrated the consistency of the chemical language model’s representation concerning chemical properties. The results from benchmarking, experimental data and intrinsic analysis of the ChemLM space confirm the wide applicability of ChemLM for enhancing molecular property prediction within the chemical domain.

graphic file with name 42004_2025_1484_Figa_HTML.jpg

Subject terms: Cheminformatics, Computational chemistry, Computational chemistry


Computational prediction of molecular properties is emerging as a key component for streamlining drug development. Here, the authors report ChemLM, a transformer language model that leverages self-supervised domain adaptation on chemical molecules to enhance the predictive performance, demonstrating high accuracy in identifying highly potent pathoblockers against Pseudomonas aeruginosa.

Introduction

Approximately 12 years1 and 1.8$ billion are typically required before a drug reaches the market2 and there is an overall failure rate of 96% for candidate compounds3. The discovery and development of novel anti-infectives, especially against bacterial pathogens are challenging and prone to setbacks4. Despite unmet medical needs and the steadily increasing threat of antimicrobial resistance (AMR), the lack of new antibiotics with novel, resistance-breaking modes of action has resulted in an ’innovation gap’, potentially leading to a ‘post-antibiotic era’5. In this scenario, the available treatment options for bacterial infections become ineffective, primarily due to the spread of multi- and pan-resistant strains. This is already evident with pathogens like Pseudomonas aeruginosa, frequently found with multiple drug resistances in clinical settings6. Consequently, the World Health Organization (WHO) has identified the need for new antibiotics targeting this bacterium as a critical priority.

Languages consist of sequences from finite elements7, making the distributional hypothesis applicable: “A word is characterized by the company it keeps”8. Thus, language processing techniques leverage contextual similarities9,10, aiding applications in protein, DNA, and chemical sequences1116. SMILES, which stands for Simplified Molecular-Input Line-Entry System17 aligns with this linguistic framework18, enabling its use in language models such as Word2Vec9,19,20, RNNs18,21 in the past, and, more recently, Transformers22. Transformers leverage large chemical sequence datasets through transfer learning23,24. This approach pretrains models on broad tasks with abundant data before fine-tuning them for specific applications, enhancing performance and convergence speed. Initially developed for supervised learning, transfer learning now extends to self-supervised tasks2527, allowing model pretraining on massive datasets.

Here, we describe ChemLM, a language model for efficient transfer learning for chemical compounds. ChemLM utilizes the SMILES representation of molecules as sentences of the input language and a three-stage training process for predicting a specific molecular property of chemical compounds. This includes pretraining of a self-supervised language model on large datasets, self-supervised training on further domain-specific data and subsequent model optimization in a supervised setting. With this, we aimed for a model that can be applied for real-world datasets of experimental compounds that comprise of limited training samples/compounds. We assessed whether language models’ training using domain adaptation, which allows us to adapt the pre-trained model on further data from the target domain, enhances the model’s predictive ability. We performed extensive performance comparisons to the state-of-the-art models. We furthermore investigated whether the model successfully captures the underlying chemical information and reproduces the chemical space. Moreover, we predicted the potency of candidate pathoblocker compounds against Pseudomonas aeruginosa from an experimental dataset encompassing just 219 compounds, demonstrating the value of ChemLM for this application in the drug discovery process.

Results

The ChemLM method

ChemLM is a transformer-based model that processes molecular SMILES as sentences representing chemical structures. It is trained in three stages: (i) self-supervised pretraining, (ii) domain-specific pretraining, and (iii) fine-tuning for molecular property prediction (Fig. 1a). Initially, a transformer-based language model learns chemical language from a large compound corpus (pretraining). Then, it undergoes further self-supervised training on domain-specific compounds, optionally using data augmentation. Finally, the model is fine-tuned through supervised training for specific tasks. Throughout, SMILES representations are processed into chemical “words” as input for ChemLM (Fig. 1b).

Fig. 1. The ChemLM training strategy.

Fig. 1

a Training stages of the ChemLM model. All the trained models are represented by circular shapes, BBPE models are in purple and RoBERTa is in yellow. Procedures like training, augmentation, and prediction are indicated with rectangles. The dashed line indicates the flow of information within a training stage, whereas the solid line describes the transfer of knowledge from one training stage to another. b An example that indicates how a SMILES string is processed and treated by the ChemLM transformer model. Firstly, it gets tokenized and special tokens are added to the sequence. Then, these are fed into the model and at the end, the sum of weights from the hidden layers is used to make predictions.

(i) Language-model pretraining: Pretraining, a key step in transfer learning, involves training the model on millions of samples before fine-tuning for a specific task. Masked language modeling (MLM) randomly masks input tokens, training the model to predict them using the surrounding context. ChemLM was first trained on 10 million ZINC compounds using MLM, following BERT25. Unlabeled tokenized SMILES data were used to learn compound representations, creating the ChemLM base model that encodes the syntax and semantics of chemical compounds.

(ii) Domain adaptation for the language modeling: In this stage, the pretrained model is further trained on domain-specific, unlabeled data, refining its ability to capture task-specific structures and improving performance28,29. Domain adaptation addresses domain shift differences in data distributions between pretraining and target tasks, which can hinder generalization. This is crucial since ChemLM, trained on millions of diverse compounds, must perform well on structurally similar molecules. In natural language processing (NLP), domain adaptation resembles fine-tuning, achieved through continued pretraining or smaller task-specific datasets, as shown by Gururangan et al.30. Similarly, MLM-based domain adaptation has been effective in NLP31, demonstrating that unsupervised task-specific training enhances transformer models. To counter limited domain-specific data, we applied SMILES enumeration32, generating additional representations by reordering atoms (Supplementary Algorithm 1). This is a computationally efficient augmentation method to expand the whole dataset by several factors. Since the model is trained unsupervised using MLM, no information leaks into the evaluation phase.

(iii) Supervised fine-tuning of the transformer language model network: In the final phase, the trained model undergoes supervised fine-tuning. To prevent overfitting, we deploy early stopping in addition to techniques in model development, e.g., L2 regularization. Instead of freezing the transformer’s layers and fine-tuning only the classification head, we choose to unfreeze all of them and further fine-tune them to optimize performance. The attention maps, spread across various layers of a transformer model trained on chemical compounds, can be utilized to demonstrate how different chemical tokens interact in creating the final language model-based embedding of these compounds (Supplementary Fig. 1).

Architecture optimization

Hyperparameters significantly impact deep learning models, thus, we analyzed key parameters in ChemLM and transformers for molecular property prediction. Using Optuna, we optimized the augmentation number, hidden layers, attention heads, and embedding types (Supplementary Table 1) and assessed their influence via Optuna’s f-ANOVA test (Fig. 2). A crucial factor was the augmentation number in domain adaptation training, representing alternative SMILES forms. We tested randomized SMILES between 0 and 100 and found that high values (80-100) were consistently selected, leading to a linear increase in training time (Supplementary Table 2). Following BERT25, we explored optimal embeddings by combining layer weights through summation or averaging, either in the last layer or across multiple layers. We also compared using the first token versus all tokens, as the first token encapsulates sequence information and receives the most attention25,33. Embeddings type strongly influenced performance (Fig. 2), whereas attention heads and layer count had minimal impact. The final optimized hyperparameters for each task are detailed in Supplementary Table 3.

Fig. 2. Importance of hyperparameters in model’s performance during hyperparameter optimization using the validation data of each dataset.

Fig. 2

The examined hyperparameters are: the embeddings type, the number of attention heads and hidden layers, and the augmentation number.

ChemLM identifies potent pathoblockers for P. aeruginosa

In drug discovery, oftentimes, a very limited number of compounds are available, substantially fewer than those included on commonly used benchmark datasets for chemical property prediction tasks. To assess the value of ChemLM model for a real-world drug discovery problem, we employed it to identify potent pathoblockers compounds acting against P. aeruginosa (Fig. 3a), which is one of the priority pathogens identified by the World Health Organization, often characterized by multidrug resistance6. The class of compounds that we focused on disrupts the quorum-sensing (QS) machinery of P. aeruginosa3438 (PqsR Inverse Agonists 2018 Ref. No. WO2020007938A1 (EP18181475), New PqsR Inverse Agonist 2020 (EP20150104), and Novel PqsR Inverse Agonists 2020 Ref. No. WO2021136805A1 (EP20150119)) using a compound library of 219 structures with varying potency. The drug target is the QS receptor and transcription factor PqsR39.

Fig. 3. Description of experimental data.

Fig. 3

a Chemical structure and the number of compounds per class. b Performance comparison of ChemLM with graph neural networks and transformer-based approaches in 5-fold validation for experimental compounds on Pseudomonas aeruginosa. The graph neural networks (blue) are graph attention transformers (GAT)42, message-passing neural networks (MPNN)40, and graph convolutional neural networks (GCNN)41. MolBERT44, MolFormer43, and ChemBERTa23 are transformer-based approaches. The ChemLM model is noted in red. Gray dots represent the F1-scores achieved by the model across the five folds.

Small molecular compounds acting on PqsR via an inverse agonistic mode-of-action reduce the production of several virulence factors such as the toxin pyocyanin. The initial hit already impaired pyocyanin production with a potency in the double-digit micromolar range and was characterized by a trifluoromethyl-pyridine fragment37. A lead generation campaign via structure-guided fragment growing was initiated, which yielded five QS inhibitor classes with substantially increased potency3436 (Fig. 3a) and retaining this fragment motif. The inhibitor classes are described in more detail in peer-reviewed journals or patent applications (Supplementary Table 4). We use the IC50 to measure drug potency, which is the inhibitor concentration needed to inhibit a biological process in vitro by 50%. Highly potent compounds have an IC50 of <500 nM. For the five classes, the number of compounds and their potencies vary considerably; from 2 to 107, and include between 0 and 71 highly potent compounds.

To rigorously evaluate the performance of the ChemLM model, we devised a challenging scenario. Given the substantial variation in the number of compounds per class in the compound library, we pursued an alternative approach to partition the data into more similarly sized folds. We employed ward linkage hierarchical clustering on the ChemLM embeddings and partitioned the library into five sets of chemically similar compounds, resulting in a more even distribution (Supplementary Table 5). Specifically, we organized the compound library by grouping compounds into these folds based on ChemLM’s embeddings similarity. This approach ensures that compounds with chemical similarity, even if they belong to different structural classes, are kept together within the same fold, as opposed to using the initial structural classes. This strategy helps prevent information leakage during model training and introduces a demanding challenge for the ChemLM model. Subsequently, we conducted the third stage of model training using the SMILES representations of compounds from four of the folds. The compounds from the remaining fold were then classified as highly potent or not. This process was repeated for each set of folds (Fig. 3b) and the same hyperparameters were used for all models (Supplementary Table 3).

For model assessment, we compared ChemLM to leading graph neural networks and language models. Graph models included MPNN40, GCNN41, and GAT42, using DeepChem (v2.6.0) with default settings. Language models included MolFormer43, MolBERT44, and ChemBERTa23 using the “PubChem10M_SMILES_BPE_180k” model from Hugging Face, pretrained on 10 million SMILES from PubChem. Furthermore, approaches that did not provide pretrained models or codes were excluded. ChemLM’s training extends methods used in ChemBERTa and vanilla ChemLM by adding a second training stage on task-specific data with augmented SMILES representations and hyperparameter optimization.

In comparison to these models, ChemLM achieved the highest median of macro-averaged F1-scores (0.90), which is almost 30% more than that of the second-best model (MPNN; Fig. 3b, Supplementary Table 6). The same applies to all the evaluation metrics we examined. Moreover, its performance in identifying highly potent pathoblockers is quite high, as the F1-score for that class in each of the five folds consistently ranges from above 0.83 to a maximum of 0.92 in all folds (Supplementary Table 7). Most notably, ChemLM demonstrates consistency when compared to other models, which either fail or perform poorly on this task in certain folds. In addition, ChemBERTa achieved a median F1 score of 0.33 across folds (Supplementary Table 6 and Fig. 3b), faced challenges particularly for the positive class, recording an F1-score of only 0.17 for that class in the 4th fold (Supplementary Table 7). These results highlight the value of the optimized ChemLM for identifying highly potent compounds for an application with a very limited number of compounds available for a task-specific training scenario.

Optimizing ChemLM substantially improves performance

We evaluated ChemLM’s performance on binary classification tasks for molecular property prediction, comparing it to the same models as earlier across three benchmark datasets (Supplementary Table 8). Datasets were split using DeepChem’s splitter45 to maintain class distribution across training (70%), validation (10%), and test (20%) sets. Training parameters for graph neural networks, including epochs and learning rate, were optimized via grid search within the DeepChem framework.

First, a ChemLM vanilla model was trained without using a domain adaptation phase or hyperparameter optimization. In its architecture, 12 layers and attention heads were included and pooling as the type of embeddings (Supplementary Table 3). A second model, ChemLM domain-adapted, was then trained on domain-specific data, with augmented SMILES representations and no hyperparameter optimization using the same architecture as ChemLM vanilla. Finally, for the ChemLM domain-adapted and optimized model, all the hyperparameters were optimized and in addition, we unfroze the model’s layers for fine-tuning in the task-specific training.

To assess performance, we primarily report the macro-averaged F1-score, as a balanced reflection of the performance of the models across both positive and negative classes, irrespective of their sizes. For further details on the model performances, we also provide the accuracy, AUC, precision, and recall values for the individual classes. In terms of the macro-average F1-score metric, the optimized ChemLM was among the top performers in benchmark evaluation (Fig. 4). It performed substantially better than the graph-based models, with an improvement of up to 0.25 in macro-averaged F1-score on the ClinTox dataset relative to the second-best performing model (0.91 vs. 0.66; Table 1). ChemBERTa had an F1-macro averaged score of 0.9 and 0.87 on the ClinTox and BBBP datasets, respectively (Table 1 and Supplementary Table 9), performing slightly less well than ChemLM on these (0.92 and 0.88, respectively). On the BACE dataset, ChemBERTa’s performance was notably weaker, achieving a macro-averaged F1 score of 0.69, compared to 0.8 for ChemLM (Supplementary Table 10). Compared to MolBERT, which is also based on transformers, we observed a very similar performance for two of the datasets, 0.88 versus 0.89 on BBBP and 0.80 versus 0.81 on BACE, respectively (Supplementary Tables 9 and 10). On the contrary, MolFormer achieved a slightly higher F1-score on the BBBP dataset, 0.92 versus 0.8 (Supplementary Table 9), through performing worse with other evaluation metrics and demonstrated a notably better performance on the BACE dataset, 0.8 versus 0.9 (Supplementary Table 10). However, ChemLM substantially outperformed them on the ClinTox dataset with the macro-averaged F1-score increasing by 36% from MolBERT’s 0.67 to 0.91 (Table 1). This performance improvement of the optimized ChemLM on the ClinTox dataset is primarily due to the substantially lower performance of all other models on the positive class (Supplementary Table 11), ranging from 0.22 to 0.38 versus 0.84 for the ChemLM model. Notably, the positive class of the ClinTox data set has the lowest number of samples of all data sets and the largest degree of class imbalance. Similarly to what we observed for the experimental pathoblocker dataset, all other models tended to not identify the few positive samples correctly for this dataset (Supplementary Tables 6 and 11).

Fig. 4. Performance of ChemLM and state-of-the-art models with the macro averaged F1-score on the test data sets of the benchmark data.

Fig. 4

ChemLM and its variations are compared with state-of-the-art models. Red diamonds represent the mean macro-averaged F1-score for each model across the three datasets.

Table 1.

Comparison of ChemLM on ClinTox dataset with its simpler versions, and state-of-the-art models in more evaluation metrics

Model F1 AUC Precision Recall Accuracy
MolFormer 0.55 0.7 0.89 0.7 0.95
MolBERT 0.67 0.67 0.66 0.67 0.92
ChemBERTa 0.9 0.84 0.99 0.84 0.98
MPNN 0.64 0.59 0.87 0.59 0.94
GAT 0.59 0.57 0.72 0.57 0.93
GCNN 0.66 0.61 0.78 0.61 0.94
ChemLM vanilla 0.48 0.50 0.46 0.50 0.93
ChemLM domain-adapted 0.82 0.75 0.98 0.75 0.96
ChemLM domain-adapted & optimized 0.92 0.86 0.99 0.86 0.98

The macro-averaged score is reported for each metric.

We also observed a substantial improvement between the vanilla and the domain-adapted ChemLM models, demonstrating the benefits of adding the domain adaptation stage and the data augmentation within it. The performance improvements range from 15% for the BACE dataset, from 0.51 to 0.80 for the macro-averaged F1-score, to up to 30% for the ClinTox dataset, from 0.48 to 0.92 with this metric. The complete evaluation of the models for these datasets can be found in the Supplementary material (Supplementary Tables 9 and 10).

ChemLM embeddings reflect molecular properties of chemical compounds

To evaluate whether ChemLM’s embeddings reflect molecular properties relevant to drug efficacy, we analyzed their relationship to chemical properties. Since transformer models are self-supervised, no inherent correlation is expected before fine-tuning. To this end, we calculated the median ratio of distances between property values and embedding vectors for randomly selected compounds from the BBBP dataset, comparing it to a shuffled property dataset. This analysis covered three key physicochemical properties: molecular weight, quantitative estimate of drug-likeness (QED), and polar surface area. Using 200 randomly selected compounds, we generated embeddings over 100 rounds to produce a ratio distribution (see the “Methods” section, Fig. 5a). Results showed significantly lower ratios in ChemLM’s space compared to the shuffled labels (one-sided t-test, Table 2, Supplementary Table 12), indicating that ChemLM embeddings effectively capture meaningful molecular properties.

Fig. 5. Intrinsic evaluation of ChemLM.

Fig. 5

a Distribution plot showing the ratio of property differences relative to embedding distances for ChemLM and random space across three molecular properties. The properties are a quantitative estimation of drug-likeness (QED), molecular weight, and polar surface area (PSA). b Violin plots showing the ratio between the ChemLM and MolBERT models for randomly sampled molecule pairs from the BBBP dataset. c UMAP plots of molecular properties. They demonstrate the distribution of molecular properties. Each dot represents a molecule in the dataset. The molecular weight and PSA values have been scaled using the natural logarithm, while the actual values have been used for QED and the number of aromatic rings.

Table 2.

Median ratio values and their p-value for each molecular property

Molecular property ChemLM Random space Ratio p-value
Molecular weight 4.951 5.56 0.89 7.48e−32
QED 0.01 0.012 0.83 5.05e−60
Polar surface area 1.24 1.35 0.93 8.74e−14

Low median values were observed for the ratio of property distances to embedding distances of most properties and a relatively stable ratio of the ChemLM to permuted data constant. P-values were calculated using a one-tailed t-test to assess whether the means of the ChemLM ratio distribution was significantly higher than the one generated by random permutation of the molecular property labels.

To explore this behavior in similar models, we analyzed the local relationships of the embeddings of the MolBERT and ChemLM to these molecular properties, by calculating the ratio using the embedding and property distances of the drug pairs from the randomly subsampled molecules of the BBBP dataset (Fig. 5b, Table 3). Violin plots of the property differences show that both model embeddings have a similar relationship to the properties, with comparable median and median absolute deviation values. These findings confirm that both ChemLM’s and MolBERT’s embeddings capture and reflect the molecular properties.

Table 3.

Statistical measures of property differences for ChemLM and MolBERT

QED Polar surface area Molecular Weight
Statistical measure ChemLM MolBERT ChemLM MolBERT ChemLM MolBERT
Max value 0.23 0.07 44.59 9.45 80.34 26.38
Median value 0.008 0.01 0.89 1.02 3.94 4.51
Standard deviation 0.008 0.007 1.63 0.88 4.58 3.48
MAD 0.005 0.006 0.52 0.58 2.28 2.51

To qualitatively assess our results, we visualized the embeddings of molecules in a two-dimensional space using UMAP. This approach allowed us to determine whether compounds are encoded in meaningful embeddings in the ChemLM model, aligning chemicals with similar physicochemical properties in close proximity, while maintaining the global structure of the data distribution. We applied this technique for the previously assessed molecular properties including the aromatic rings in our evaluation (Fig. 5c). For all properties, we observed a gradual change of these properties in this space, indicating that molecules with similar properties tend to possess similar embedding values.

Discussion

In this study, we introduce ChemLM, a language model for molecular property prediction, incorporating key innovations in chemical language modeling. The first is an additional training stage, where the model undergoes self-supervised training on domain-specific compound representations extending the standard pretraining-finetuning approach23,24. Similar unsupervised domain and task adaptation strategies have proven effective in NLP across deep learning architectures29,46 benefiting chemical transformer models. This adaptation phase enhances the model’s understanding of chemical associations, improving predictive performance, especially for tasks with limited domain-specific data. The second innovation is data augmentation on domain datasets, generating alternative sequence representations to increase training instances, a particularly beneficial technique for chemical tasks with small, imbalanced datasets. In this work, we demonstrate the effectiveness of this approach, combined with a domain adaptation stage, which leads to substantial performance improvements, especially for classification tasks involving small, imbalanced datasets.

We comprehensively assessed the ChemLM model on suitable benchmark datasets for molecular property prediction; the BACE (inhibition of the BACE-1 enzyme), the BBBP  (blood–brain barrier penetration), and the ClinTox dataset (clinical toxicity) originating from MoleculeNet. Across all datasets, ChemLM demonstrated a substantial performance gain of up to 20% relative to graph neural networks. These results suggest that a transformer-based approach can surpass the performance of leading Graph Neural Network architectures. When compared to other language processing models, such as MolBERT and MolFormer, ChemLM showed substantial performance improvements, particularly on the highly imbalanced ClinTox dataset, where it outperformed the other methods by more than 20% in F1-score. These findings highlight the effectiveness of ChemLM in accurately identifying the positive class within imbalanced datasets.

As a real-world test case, we also evaluated ChemLM on its ability to identify compounds targeting Pseudomonas aeruginosa, a hospital-acquired pathogen known for its multi-drug resistance. The model demonstrated substantial performance gains in identifying potent pathoblocker compounds effective against Pseudomonas aeruginosa from a chemical compound library, specifically targeting the transcription factor PqsR. To thoroughly evaluate the model’s ability to make predictions for structurally diverse candidate molecules, we stratified the dataset of experimental compounds based on structural similarities for cross-validation. In this evaluation, ChemLM achieved a 30% performance improvement over the second-best model for this task. The model also achieved a high F1-score for the positive class (highly potent pathoblockers) across all folds, demonstrating its ability to generalize effectively and maintain high consistency in performance. These results suggest that the performance gains offered by ChemLM could significantly aid in identifying relevant drug compounds for pharmacological applications. Further development of ChemLM could expand its capabilities from predicting active compounds to estimating activity levels and suggesting potential compound structures via generative models, thereby supporting a wide range of future experimental data analyses.

Compared to standard training, ChemLM showed significant performance gains with expanded self-supervised training across all benchmarks. Hyperparameter optimization identified embeddings as the most influential factor. During domain adaptation, incorporating multiple molecular representations further improved performance. These findings provide valuable insights for future hyperparameter tuning.

To assess the relationship between ChemLM embeddings and drug-relevant properties, we computed the ratio of property differences to embedding distances. Median ratio values were significantly lower than those in a randomly shuffled space. We also analyzed property distributions within ChemLM and MolBERT embeddings, where smaller ratios indicated more coherent mappings. Overall, the evaluation showed that transformer models encode molecular properties meaningfully, providing robust representations across multiple properties.

In summary, we describe ChemLM, an optimized chemical language encoder model designed to predict the molecular properties of chemical compounds. Our evaluation across multiple datasets demonstrated substantial performance improvements achieved through self-supervised training on domain-specific data and data augmentation, resulting in enhanced accuracy in molecular property prediction and the creation of a chemically meaningful encoding space. Additionally, hyperparameter optimization boosted the model’s performance. Together, these findings highlight the potential of transformer models in advancing chemical research. Notably, the model’s primary achievement lies in its successful application to real-world data and predictive challenges. It excelled at identifying potent pathoblockers against P. aeruginosa using a very limited amount of training data, underscoring its promise to accelerate drug discovery efforts in the future.

Methods

Data description

We used two types of datasets to train and evaluate the model’s performance. The first is the ZINC (v15) database, a public collection of millions of chemical compounds47. We retrieved the SMILES representations of the molecules and used them in the pretraining stage of the ChemLM model. The second ones were three benchmark datasets from MoleculeNet48 for predicting the physicochemical properties of molecules (Supplementary Table 8). BACE’s target class indicates binding results for a set of inhibitors to β-secretase 1. The blood–brain barrier penetration dataset (BBBP) is a collection of compounds from a study about compounds’ brain barrier permeability in which class labels indicate penetration or non-penetration. ClinTox includes compounds that can be used for the tasks of FDA approval status and clinical trial toxicity. Here, we evaluated the models on the second task.

Tokenization using byte pair encoding (BPE)

One critical step is tokenizing SMILES, treating each string as a sentence of tokens. ChemLM uses byte-level byte pair encoding (BBPE)49 for this, as recommended for RoBERTa. Initially a compression method, BPE50 assigns new symbols to frequent character pairs, enabling a hybrid word/character-level tokenization. Another advantage is the user-defined vocabulary size, determining token count and influencing tokenization. Since SMILES has a smaller vocabulary than natural languages, we explored different sizes trained on ZINC—from 10,000, as Wang et al. suggested49, to 2058, including special tokens, and used this setting in further experiments. To learn byte sequences, a BBPE tokenizer was trained on ZINC, a large SMILES corpus making it adaptable for various applications and datasets.

Transformers

ChemLM is based on transformers, a deep learning architecture originally designed with an encoder–decoder structure22. At the core of multi-head attention lies the concept of self-attention, which focuses on generating improved representations of the sequence elements (tokens) by considering their interactions with neighboring elements. This self-attention mechanism is utilized within multi-head attention to enable the model to attend to multiple views of the sequence interactions simultaneously, resulting in more expressive and informative representations. Thus, each layer of the encoder includes a multi-headed attention sublayer and a position-wise fully connected feed-forward network followed by normalization layers. In attention in transformer models, each token of the sequence is associated with two real-valued vector representations: a key vector, k from the input embedding space and a value vector, v from the output embedding space. These vectors can be either randomly initialized or pretrained. The query vector q represents the sequence element for which one wants to obtain a new representation and must belong to the same space as the key vectors. To calculate a new representation for the entire sequence, the key, k, query, q, and value v vectors are calculated using dot multiplication of the embedding with the corresponding learned weight matrices. Matrix multiplications are deployed to leverage efficiency and parallelization. Embeddings, query, key, and value vectors are packed into matrices, X, K,Q, and V. Attention is calculated as described in equation 1, in which dk stands for the dimension of vector k.

Attention(Q,K,V)=softmaxQKTdkV 1

Instead of using a single attention mechanism, researchers introduced multihead attention. Its benefit to the model lies in the information that is captured from different representation subspaces at different positions.

MultiHead(Q,K,V)=Concat(head1,,headh)W0, 2

where each headi is equal to

headi=Attention(QWiQ,KWiK,VWiV) 3

and Wi is the weight matrix.

Matrices WiQRdmodel×dk, WiKRdmodel×dk, WiVRdmodel×dV and WORhdV×dmodel where h are the parallel attention layers are learned weights. Another key element is the addition of Position Encoding to provide information about the position of tokens in the sequence and counteract the absence of recurrent or convolutional elements.

PE(pos,2i)=sinpos10,0002idmodel 4
PE(pos,2i+1)=cospos10,0002idmodel 5

In formulas 4 and 5, pos stand for the position in the sequence, dmodel for the dimension of the output embedding space and i, for the embedding index. This architecture offers several advantages, addressing sequence model limitations. The self-attention mechanism captures long-term dependencies by modeling interactions between distant tokens. Transformers are highly scalable, handling variable-length inputs independently at each position. Their parallelizable design enables efficient computation, reducing training and inference time. Additionally, transfer learning allows pretraining on unlabeled data before fine-tuning for specific tasks.

RoBERTa was chosen from autoencoder models. Based on BERT, it uses learnable position embeddings instead of sinusoidal encodings (see formulas 4 and 5). With an average SMILES sequence length of 45 tokens, RoBERTa is well-suited. Language models have various training tasks, like next-sentence prediction. RoBERTa masks tokens and learns to predict them from context, a process called masked language modeling. This approach helps the model grasp SMILES syntax and grammar, making RoBERTa ideal for this application.

ChemLM implementation and training

To implement ChemLM, HuggingFace51 (version 0.0.8) was used to configure and train the RoBERTa model for the first training stages. A combination of Huggingface and PyTorch52 (version 1.6) was used for the supervised fine-tuning. In addition, scikit53 (0.24.1) was deployed for hierarchical clustering and evaluation metrics and RDKit54 (v2020.09.1.0) to produce the molecular properties for the intrinsic evaluation.

The ChemLM model was trained using MLM in the first two training stages, pretraining and domain adaptation. The domain adaptation training stage used multiple SMILES representations for each molecule. These representations were generated using SMILES enumeration, a data augmentation technique for SMILES strings32. We experimented with different numbers of augmentations (Supplementary Table 2) per molecule to find the best-performing one during the hyperparameter optimization. All training was performed using an NVIDIA t4 GPU.

Intrinsic evaluation

Quantitative evaluation

To assess the degree to which the learned compound embeddings of the ChemLM model reflect properties such as the molecular weight, polar surface area, and the quantitative estimate of drug-likeness that are relevant for drug efficacy, we assessed the continuity of their representations in the embeddings. Thus, we used the ratio (r) of property difference to embedding distance. ChemLM was compared to the properties of a data set with randomly shuffling assigned molecular properties. Specifically, we calculated the properties for a subset of compounds from the BBBP dataset using RDKit and examined the ratio as described as (Eq. (6))

r=df(f(e1),f(e2))de(e1,e2) 6

where f is the property value, df the absolute difference of these values for the embeddings e1, e2, r is the ratio and de the Euclidean distance of the embeddings.

The ratio is used to understand how property values change with molecular embedding variations. A low, stable ratio indicates predictable changes. Embeddings were derived from the final layer’s weights. To compare ChemLM with a shuffled random space, we subsampled 200 compounds for 100 rounds from 1200 BBBP compounds. A one-tailed t-test (Scipy v1.8.0) assessed statistical significance, testing whether ChemLM’s ratio is lower than random, and computed the p-value using the ’greater’ alternative. For MolBERT comparison, we calculated the ratio for all molecule pairs in the selected BBBP compounds for both models.

Qualitative evaluation

In addition to the quantitative evaluation of the trained space, we projected the 768-dimensional vectors of molecule embeddings to 2D space using the UMAP algorithm. The molecular properties examined are molecular weight, polar surface area (PSA), quantitative estimate of drug-likeness (QED), and the number of aromatic rings. To make the distribution of the properties more evident, we scaled the values of the first two properties using the natural logarithm. In this way, we can visually inspect the distribution of the aforementioned properties.

Supplementary information

Supplementary Information (429.7KB, pdf)

Acknowledgements

G.K. gratefully acknowledges financial support from the Lower Saxony Ministry for Science and Culture for the doctoral program “Drug Discovery and Cheminformatics for New Anti-Infectives (iCA)”.

Author contributions

G.K., E.A., and A.C.M. conceived the study. G.K. implemented the software. E.A. supervised the code development. A.C.M. and E.A. supervised the work. G.K., E.A., and A.C.M. have written the article. A.H. and M.E. have shared the experimental dataset, advised and guided the work on the corresponding part. M.E. has contributed to writing, too. F.K. advised on the intrinsic evaluation and provided feedback. All authors have reviewed the article.

Peer review

Peer review information

Communications Chemistry thanks Jiashun Mao and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Data availability

Data is available at https://github.com/hzi-bifo/ChemLM/datato reproduce benchmark experiments and intrinsic evaluation. The generated data used for analysis and to produce figures is located at https://github.com/hzi-bifo/ChemLM/results. Experimental data have been presented in peer-reviewed journals or patent applications and are detailed in Supplementary Table 4.

Code availability

Code is available at https://github.com/hzi-bifo/ChemLM. The code for the experimental part is not included, as the dataset is internal. Models are available at https://huggingface.co/gkallergis.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s42004-025-01484-4.

References

  • 1.Mohs, R. C. & Greig, N. H. Drug discovery and development: role of basic biological research. Alzheimers Dement.3, 651–657 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Paul, S.M. et al. How to improve RD productivity: the pharmaceutical industry’s grand challenge. Nat Rev. Drug Discov.9, 203–214 (2010). [DOI] [PubMed]
  • 3.Hingorani, A. D. et al. Improving the odds of drug development success through human genomics: modelling study. Sci. Rep.9, 18911 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Silver, L. L. Challenges of antibacterial discovery. Clin. Microbiol. Rev.24, 71–109 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kwon, J. H. & Powderly, W. G. The post-antibiotic era is here. Science373, 471–471 (2021). [DOI] [PubMed] [Google Scholar]
  • 6.Karakonstantis, S., Kritsotakis, E. I. & Gikas, A. Pandrug-resistant gram-negative bacteria: a systematic review of current epidemiology, prognosis and treatment options. J. Antimicrob. Chemother.75, 271–282 (2020). [DOI] [PubMed] [Google Scholar]
  • 7.Chomsky, N. Syntactic Structures (De Gruyter Mouton, 2009).
  • 8.Harris, Z. S. Distributional structure. Word World10, 146–162 (1954). [Google Scholar]
  • 9.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst.26, (2013)
  • 10.Rieger, B. B. On Distributed Representation in Word Semanticshttps://www.uni-trier.de/fileadmin/fb2/LDV/Rieger/Publikationen/Aufsaetze/91/icsi91.pdf (2023).
  • 11.Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE10, 1-15, 10.1371/journal.pone.0141287 (2015) [DOI] [PMC free article] [PubMed]
  • 12.Asgari, E., McHardy, A. & Mofrad, M. R. K. Probabilistic variable-length segmentation of protein sequences for discriminative motif mining (DiMotif) and sequence embedding (ProtVecX). Sci. Rep.10.1101/345843 (2019). [DOI] [PMC free article] [PubMed]
  • 13.Asgari, E. Life Language Processing: Deep Learning-based Language-agnostic Processing of Proteomics, Genomics/metagenomics, and Human Languages. UC Berkeley Electronic Theses and Dissertations. University of California, Berkeley (2019)
  • 14.Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022). [DOI] [PubMed]
  • 15.Benegas, G., Albors, C., Aw, A. J., Ye, C. & Song, Y. S. GPN-MSA: An Alignment-based DNA Language Model for Genome-wide Variant Effect Prediction10.1101/2023.10.10.561776. https://www.biorxiv.org/content/10.1101/2023.10.10.561776v1 (2023).
  • 16.Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun.14, 114 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Weininger, D. SMILES, a chemical language and information system: 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci.28, 31–36 (1988). [Google Scholar]
  • 18.Schwaller, P., Gaudin, T., Lányi, D., Bekas, C. & Laino, T. "Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci.9, 6091–6098 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Öztürk, H., Ozkirimli, E. & Özgür, A. A novel methodology on distributed representations of proteins using their interacting ligands. Bioinformatics34, 295–303 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chakravarti, S. K. Distributed representation of chemical fragments. ACS Omega3, 2825–2836 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Skinnider, M. A., Greg Stacey, R., Wishart, D. S. & Foster, L. J. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell.3, 759–770 (2021).
  • 22.Vaswani, A. et al. Attention is all you need. Adv. in N. Inf. Process. Syst.30, (2017)
  • 23.Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at http://arxiv.org/abs/2010.09885 (2020)
  • 24.Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn.: Sci. Technol.3, 015022 (2022). [Google Scholar]
  • 25.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (eds Burstein, J., Doran, C., & Solorio, T.) Vol. 1, Long and Short Papers 4171–4186 (Association for Computational Linguistics, Minneapolis, MN, 2019).
  • 26.Dong, L. et al. Unified Language Model Pre-Training for Natural Language Understanding and Generation 13063–13075 (Curran Associates Inc., Red Hook, NY, USA, 2019).
  • 27.Varnek, A. et al. Inductive transfer of knowledge: application of multi-task learning and Feature Net approaches to model tissue–air partition coefficients. J. Chem. Inf. Model.49, 133–144 (2009). [DOI] [PubMed] [Google Scholar]
  • 28.Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. EMNLPIJCNLP 2019—2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (eds. Inui, K., Jiang, J. Ng, V. & Wan, X.) 3615–3620 (Association for Computational Linguistics, 2019).
  • 29.Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. ACL 2018—56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).
  • 30.Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 8342–8360 (Association for Computational Linguistics, 2020).
  • 31.Arefyev, N., Kharchev, D. & Shelmanov, A. Efficient domain adaptation of masked language models for sentiment analysis. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F., Huang, X., Specia, L., Yih, S. W.-T.) 9114–9124 (Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021).
  • 32.Bjerrum, E.J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at http://arxiv.org/abs/1703.07076 (2017).
  • 33.Vig, J., Belinkov, Y., John, H. & Paulson, A. Analyzing the structure of attention in a transformer language model. In Proc. 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (eds Linzen, T., Chrupala, G., Belinkov, Y. & Hupkes, D.) 63–76 (Association for Computational Linguistics, 2019).
  • 34.Schütz, C. et al. A new PqsR inverse agonist potentiates tobramycin efficacy to eradicate Pseudomonas aeruginosa biofilms. Adv. Sci.8, 10.1002/ADVS.202004369 (2021). [DOI] [PMC free article] [PubMed]
  • 35.Schütz, C. et al. Divergent synthesis and biological evaluation of 2-(trifluoromethyl)pyridines as virulence-attenuating inverse agonists targeting PqsR. Eur. J. Med. Chem.226, 10.1016/J.EJMECH.2021.113797 (2021). [DOI] [PubMed]
  • 36.Hamed, M. M. et al. Towards translation of PqsR inverse agonists: from in vitro efficacy optimization to in vivo proof-of-principle. Adv. Sci. 2204443 10.1002/ADVS.202204443 (2023). [DOI] [PMC free article] [PubMed]
  • 37.Zender, M. et al. Flexible fragment growing boosts potency of quorum-sensing inhibitors against Pseudomonas aeruginosa virulence. Chemmedchem15, 188 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Abdelsamie, A. S. et al. Discovery and optimization of thiazole-based quorum sensing inhibitors as potent blockers of Pseudomonas aeruginosa pathogenicity. Eur. J. Med. Chem.276, 116685 (2024). [DOI] [PubMed] [Google Scholar]
  • 39.Schütz, C. & Empting, M. Targeting the Pseudomonas quinolone signal quorum sensing system for the discovery of novel anti-infective pathoblockers. M. Beilstein J. Org. Chem.14, 10.3762/bjoc.14.241 (2018). [DOI] [PMC free article] [PubMed]
  • 40.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Proc. 34th International Conference on Machne Learning, ICML 2017 Vol. 3, 2053–2070 (2017).
  • 41.Kipf, T.N. & Welling, M. Semi-supervised classification with graph convolutional networks. In Proc. 5th International Conference on Learning Representations, ICLR 2017—Conference Track Proceedings (2016)
  • 42.Veličković, P. et al. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018—Conference Track Proceedings (2017).
  • 43.Ross, J. et al. Large-scale chemical language representations capture molecular structure and properties. Nat. Mach. Intell.4, 1256–1264 (2022). [Google Scholar]
  • 44.Fabian, B. et al. Molecular representation learning with language models and domain-relevant auxiliary tasks. Preprint at https://arxiv.org/abs/2011.13230 [cs.LG] (2020).
  • 45.Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019).
  • 46.Chakrabarty, T., Hidey, C. & McKeown, K. IMHO fine-tuning improves claim detection. North Am. Chapter Assoc. Comput. Linguist. 558–563 10.18653/v1/N19-1054 (2019).
  • 47.Sterling, T. & Irwin, J. J. ZINC 15—ligand discovery for everyone. J. Chem. Inf. Model.55, 2324–2337 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci.9, 513–530 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Wang, C., Cho, K. & Gu, J. Neural machine translation with byte-level subwords. In AAAI 2020—34th AAAI Conference on Artificial Intelligence 9154–9160 (2019).
  • 50.Gage, P. A new algorithm for data compression. C Users J.12, 23–38 (1994). [Google Scholar]
  • 51.Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45 (Association for Computational Linguistics, 2020).
  • 52.Antiga, L. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. 33rd International Conference on Neural Information Processing Systems, 8026–8037 (Curran Associates Inc., 2019).
  • 53.Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res.12, 2825–2830, (JMLR.org, 2011). [Google Scholar]
  • 54.Landrum, G. RDKit: Open-source Cheminformatics. https://www.rdkit.org (2016).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (429.7KB, pdf)

Data Availability Statement

Data is available at https://github.com/hzi-bifo/ChemLM/datato reproduce benchmark experiments and intrinsic evaluation. The generated data used for analysis and to produce figures is located at https://github.com/hzi-bifo/ChemLM/results. Experimental data have been presented in peer-reviewed journals or patent applications and are detailed in Supplementary Table 4.

Code is available at https://github.com/hzi-bifo/ChemLM. The code for the experimental part is not included, as the dataset is internal. Models are available at https://huggingface.co/gkallergis.


Articles from Communications Chemistry are provided here courtesy of Nature Publishing Group

RESOURCES