SELFprot: Effective and Efficient Multitask Finetuning Methods for Protein Parameter Prediction

Marltan Wilson; Thomas Coudrat; Andrew Warden

doi:10.1021/acs.jcim.4c02230

. 2025 Mar 18;65(7):3226–3238. doi: 10.1021/acs.jcim.4c02230

SELFprot: Effective and Efficient Multitask Finetuning Methods for Protein Parameter Prediction

Marltan Wilson ^†,^‡,^*, Thomas Coudrat ^‡,^§, Andrew Warden ^†,^‡,^*

PMCID: PMC12004530 PMID: 40098257

Abstract

graphic file with name ci4c02230_0017.jpg

Accurately predicting protein–ligand interactions and enzymatic kinetics remains a challenge for computational biology. Here, we present SELFprot, a suite of modular transformer-based machine learning architectures that leverage the ESM2–35M model architecture for protein sequence and small molecule embeddings to improve predictions of complex biochemical interactions. SELFprot employs multitask learning and parameter-efficient finetuning through low-rank adaptation, allowing for adaptive, data-driven model refinement. Furthermore, ensemble learning techniques are used to enhance the robustness and reduce the prediction variance. Evaluated on the BindingDB and CatPred-DB data sets, SELFprot achieves competitive performance with notable improvements in parameter-efficient prediction of k_cat, K_m, K_i, K_d, IC₅₀, and EC₅₀ values as well as the classification of functional site residues. With comparable accuracy to existing models and an order of magnitude fewer parameters, SELFprot demonstrates versatility and efficiency, making it a valuable tool for protein–ligand interaction studies in bioengineering.

Introduction

The accurate prediction of protein–ligand interactions is fundamental for understanding protein function, metabolism, and the rational design of novel drugs and biocatalysts. Protein–ligand interactions also give insight into the production of secondary metabolites and toxicity; this is especially important for enzyme-substrate interactions.^1,2 Enzyme-substrate interactions play a crucial role in catalysis and metabolic pathways, where kinetic parameters, such as the turnover number (k_cat) and Michaelis constant (K_m), are essential for quantifying enzyme efficiency. Dissociation constants (K_d), inhibition constants (K_i), the half-maximal inhibitory concentration (IC₅₀), and the half-maximal effective concentration (EC₅₀) further characterize the specificity and affinity of these interactions. Computational models capable of accurately predicting these parameters can significantly accelerate bioengineering workflows, including the biocatalytic production and degradation of drugs, food, and novel materials. Other capabilities, such as the classification of a functional site residue and the generation of missing residues or molecular moieties, can assist in the development of novel enzymes and protein binders. Recent advancements in machine learning, especially transformer models, have revolutionized small molecule representation and protein structure and property predictions. Originally developed for natural language processing,³ transformers have shown remarkable success in bioinformatics by capturing long-range dependencies and patterns within biological sequences.⁴⁻⁷ State-of-the-art models such as AlphaFold⁷ and RosettaFold⁸ have capitalized on the success of these models in protein structure prediction.⁹ The ESM2 family of protein language models (pLMs) has been designed to interpret the “language of proteins”, through pretraining on Uniref50 data set,¹⁰ enabling improved protein sequence understanding. ESM2 provides a diverse set of model sizes for finetuning tasks at different levels of protein representation accuracy. The resulting embeddings from pLMs have shown success in de novo protein generation and protein structure and function prediction.^11,12 In parallel, chemical language models have been adapted to tokenize molecular representations such as the simplified molecular input line entry system (SMILES) and self-referencing embedded strings (SELFIES) format. These models provide a robust method for capturing the diverse chemical space of ligands.^13,14 The SELFIES molecular representation offers some advantages over SMILES strings, such as ensuring all SELFIES can be decoded into valid chemical structures.^13,15,16 However, the complexity of SELFIES introduces challenges in model training, scalability, and interpretation.^17,18 The adoption of the SELFIES in transformers presents challenges due to its relative novelty and verbosity, resulting in poor scaling to larger complex molecules since the attention mechanism in transformer blocks scales with quadratic complexity.¹⁹ Traditionally pretrained transformers require an extensive number of parameters to achieve state-of-the-art accuracy. However, recent studies have shown that small efficient language models can achieve comparable performance to much larger models through improved optimization techniques and training data.²⁰ Knowledge distillation, ensemble learning, and parameter-efficient finetuning are three methods that have demonstrated efficacy in reducing the size of a model, reducing variance, and refining the model’s performance without requiring prohibitive computational resources.²⁰⁻²² Ensemble learning, where multiple models are trained on variations of the data set, helps improve predictive performance. The final predictions made by averaging or voting improve the generalizability by reducing biases from individual models. A combination of ensemble learning and small transformer models is likely to reduce the chance of overfitting on small data sets compared to larger and more complex models.²³⁻²⁶ Additionally, low-rank adaptation (LoRA) weights reduce the number of trainable parameters of a model during the finetuning process through low-rank decomposition of the weight matrices of the pretrained model.^21,27 These techniques not only improve the model’s efficiency but also enhance its ability to learn from a limited amount of data—a common challenge in experimental biochemical data sets.²⁸ The significance of enzyme function prediction extends beyond academic interest, impacting drug development,²⁹⁻³⁴ cellular agriculture36, and bioremediation.³⁵⁻⁴¹ Enzymes are nature’s catalysts, and understanding their interaction with substrates and inhibitors is crucial for designing effective drugs and optimizing metabolic pathways.⁴² Terms such as protein–ligand interaction, are usually used as an umbrella term for physically measurable or calculated parameters associated with protein–ligand complexes.^43,44 These parameters can then be condensed into a set of learnable tasks with varying degrees of relatedness for data-driven models.⁴⁵ Models trained on multiple tasks are expected to leverage information in larger data sets to enhance the model’s performance on similar tasks with smaller data sets.^46,47 This multitask approach not only mitigates the data scarcity in certain domains but also exploits the inherent relationships between different types of data sets. In the field of protein–ligand interactions, multitask learning should improve enzyme function prediction by training on data sets derived from various stages of enzyme catalysis11. These include binding site identification, prediction of binding affinities, and finally prediction of the rate of substrate conversion to product. Building on the advancements in small pLMs, multitask learning, ensemble learning, and parameter-efficient finetuning, SELFprot employs a transformer-based architecture that leverages the capabilities of the ESM2 pLM and a chemical language model based on the ESM2 architecture to improve the efficiency of predicting enzyme kinetic parameters. Equipping large language models (LLMs), with external tools greatly enhances their capabilities.⁴⁸ However, for a tool to be effective, it must optimize the trade-off triangle of cost, speed, and accuracy; therefore, incorporating SELFprot as a tool in local LLM multiagent workflows would require satisfactory accuracy, comparably small compute and memory footprint, and low latency.

Methods

Architecture Design

ChEBIFormer-35M a chemical language model pretrained on the ChEBI and ChEMBL data sets using masked language modeling is designed specifically for generating small molecules of biological interest, Figure 1. Its architecture was adapted from ESM2–35M with vocabularies for both SELFIES and SMILES tokenization. During pretraining, small molecules of biological interest are tokenized in either SMILES or SELFIES molecular representation. SELFprot integrates pretrained ESM2–35M and ChEBIFormer-35M models to form a cohesive framework capable of interpreting both protein sequences and small molecule representations, as shown in Figure 2. These two models are connected through a joint transformer layer that generates a unified protein–ligand representation. SELFprot is further pretrained on a masked language model task for the conditional generation of protein variants in the presence of known ligand binders and the conditional generation of small molecules in the presence of known protein receptors. The pretraining of the joint transformer layers allows SELFprot to learn residue level cross-attention scores between proteins and associated small molecules instead of naively combining sequence-level embedding vectors.

Small molecules are represented as either SMILES or SELFIES strings, which are divided into tokens displayed with space separation.

SELFprot architecture showing (a) the input sequence (protein and small molecule) embedding layers. ESM2 is used for protein embedding, while ChEBIFormer is used for small molecule embedding. (b) The embeddings are combined and the joint transformer layer embeds the joint protein–ligand sequence. (c) SELFprot outputs a generated protein and small molecule, binding residue prediction, and the mean and standard deviation of the regression tasks by modeling the output as N(μ_i, σ_i) for i ∈ {k_cat, K_m, K_i, K_d, IC₅₀, EC₅₀}.

Protein and Small Molecule Encoding

The ESM2–35M model encodes protein sequences into internal representations with dimensions N × 480, where N is the sequence length. This transformer-based model is part of a series of ESM2 models that have demonstrated robust performance in interpreting the complex language of proteins, capturing essential biological and chemical nuances encoded in amino acid sequences10. ChEBIFormer-35M used as the small molecule encoder resulted in a similar “N × 480” embedding. When encoding enzyme-substrate complexes, ChEBIFormer-35M can encode single molecules or multiple substrates and could be further extended to cofactors. However, ChEBIFormer-35M’s embedding dimensions remain the same for multiple or large molecules. After pretraining, the weights of the ChEBIFormer-35M along with the weights of the ESM2–35M model were frozen, and the models are combined by adding a trainable joint transformer layer followed by output layers for downstream regression and classification tasks.

Finetuning Classification and Regression Analysis

During training, SELFprot learned to generate energetically favorable tokens in the context of past and future tokens; this includes protein amino acids and molecular moieties. The more context given in the neighborhood of the missing token, the more accurate the model will be. Due to the small size and high perplexity of the ESM2–35M model, the SELFprot generative capabilities have been limited to small changes to a predefined protein–ligand scaffold with sufficient tokens in the immediate neighborhood of the region to be generated. Three distinct finetuning strategies are used for SELFprot.

1.
SELFprot-Full, initially trains the joint transformer layer and the task-specific heads, followed by finetuning of all model layers, including the pretrained protein and chemical language models.
2.
SELFprot-LoRA, initially trains the joint layer and task-specific heads, then all transformer layers are finetuned using LoRA weights such that

where W ∈ R^(d × d) are the pretrained weights and A ∈ R^(d × r) and B ∈ R^(r × d) are matrices such that rank r ≪ d. Two variants of SELFprot-LoRA were trained with the rank of the LoRA weight matrix set to 2 and 6 to balance the trade-off between learning and forgetting.

3.
SELFprot-Ensemble, trains three models independently and in parallel using different initializations and sampled subsets (with replacement) of the training data sets using the bootstrap aggregating technique (bagging). Only the joint transformer layers and the task-specific output layers were finetuned, and the pretrained protein and chemical language model remained frozen.

Multitask learning was used for each of the finetuning strategies. For each regression task, an additional layer was added that used the mean pooled output of the joint transformer layer to make a sequence-level prediction. However, the full embeddings from the joint transformer layer were used as input for the classification task since it required token-level predictions. The tokens from the joint-transformer layer representation of the ligand sequence were not included in the classification task. The multitask regression head predicts the means and standard deviations for enzyme kinetic parameters, including k_cat and K_M, the protein inhibition and dissociation constants k_i and k_d, and half-maximal inhibitory and effective concentrations IC₅₀ and EC₅₀ for proteins. The enzyme kinetic parameters were obtained from the CatPred database11, The negative log likelihood loss function for a Gaussian distribution

where μ_i and σ_i are the mean and standard deviation predicted for the target value y_i, are applied during model finetuning.

Each task used for the multitask learning approach was given equal weighting for simplicity.

Model Evaluation

SELFprot models are evaluated using the CatPred benchmarking data set11, by comparing performance metrics like the coefficient of determination (R²), receiver operating characteristic (ROC)-AUC/precision-recall (PR)-AUC for functional residue classification on the held-out test set. While the area under the ROC curve gives a general sense of how the model is performing, the area under the PR curve offers valuable insight into the models’ performance on predicting the positive class when dealing with imbalanced data sets such as functional site residues in a protein sequence. We used the same training, test, and validation splits as outlined in the CatPred-DB¹¹ to maintain consistency and comparability with established models (Figure 3). The Distribution of k_cat and K_i values with respect to EC classes in the CatPred-DB test set can be seen in Figure 4 alongside the distribution of the k_cat and K_i values predicted by the SELFprot-base model.

SELFprot is finetuned using three different methods, (a) SELFprot-Full is finetuned with fully trainable layers, (b) SELFprot-LoRA is finetuned with trainable LoRA weights along with fully trainable task-specific output layers, and (c) SELFprot-Ensemble has three separate models with trainable joint transformer and task-specific output layers and fixed protein and ligand pretrained layers. SELFprot-base is fintuned similar to an individual model from SELFprot-Ensemble.

Distribution of true and predicted k_cat and K_i values across EC classes. The violin plots show the distribution of log₁₀k_cat (A,B) and log₁₀K_i (C,D) for different enzyme commission (EC) classes. Panels (A,C) display the true values, while panels (B,D) show the predicted values from the SELFprot-base model.

For all finetuning tasks, the convergence of SELFprot was determined by the steady state of the validation loss. For the SELFprot-Ensemble model, the final predictions for regression tasks were determined by averaging using eqs 3 and 4

where SD is the standard deviation of the ensemble and μ_i and σ_i are the individual model mean and standard deviations. The predictions from the classification tasks were determined by voting.

To determine the effects of different tasks on the multitask training, the tasks were divided into 3 groups: enzyme kinetics (k_cat and K_m), ligand binding (K_i, K_d, EC₅₀, and IC₅₀), and functional site prediction. The SELFprot-base model was then trained with one or none of the tasks excluded, and the resulting error in the remaining tasks is shown in Table 1. k_cat showed a 154% increase in root-mean-square deviation (RMSD) when the binding tasks are removed. K_i also showed a 106% increase in RMSD when the enzyme kinetics tasks are removed. Additionally, there is a 30% increase in K_i RMSD when the functional site classification task is removed; however, there is a slight 4% decrease in function site classification precision when binding tasks are included during training (see Table 2).

Table 1. Impact of Excluding Tasks on Prediction Errors for Protein–Ligand Interaction Parameters in the SELFprot-Base Multitask Learning Model^a.

excluded task	k_cat error (RMSD)	K_i error (RMSD)	binding site error (precision)
k_cat/K_m		3.116	0.277
K_i/K_d/IC₅₀/EC₅₀	2.810		0.281
site classification	2.293	1.957
none	1.105	1.506	0.268

Open in a new tab

The table presents the RMSD for k_cat and K_i, where lower is better, and precision for functional site classification, where higher is better.

Table 2. Percentage of Protein–Ligand Complexes with Predicted Values Within One Order of Magnitude of the Experimental Value of k_cat, K_m Predicted by SELFprot, SELFprot-LoRa, SELFprot-LoRa (r = 6), SELFprot-Full, and SELFprot-Ensemble.

model	k_cat (P_1mag) (%)	k_m (P_1mag) (%)
SELFprot-base	72.1	76.8
SELFprot-LoRa (r = 2)	72.3	77.7
SELFprot-LoRa (r = 6)	72.6	77.8
SELFprot-Full	72.7	77.2
SELFprot-Ensemble	71.1	77

Open in a new tab

The ability of the SELFprot-base architecture to generalize across EC classes was evaluated by removing one EC class at a time from the training data set. The seven resulting models were evaluated on the missing EC class, and the results are shown in Figures 5 and 6. The increase in RMSD for k_cat or K_i for the excluded class was always ≤100%, suggesting that the addition of different training tasks had a greater impact on the prediction error than that of the excluded EC classes.

k_cat Prediction error (RMSD) for excluded enzyme classes in SELFprot models. The RMSD of k_cat predictions for enzyme classes excluded during training are compared. SELFprot-XEC (light purple) represents models trained with one enzyme class removed and evaluated only on the missing class. SELFprot-LoRA (r = 6) (brown) has been trained with data from all EC classes. Lower RMSD indicates better predictive performance.

K_i Prediction error (RMSD) for excluded enzyme classes in SELFprot models. The RMSD of k_cat predictions for enzyme classes excluded during training are compared. SELFprot-XEC (light purple) represents models trained with one enzyme class removed and evaluated only on the missing class. SELFprot-LoRA (r = 6) (brown) has been trained with data from all EC classes. Lower RMSD indicates better predictive performance.

The test set data was also clustered with respect to ligand similarity, as shown in Figure 7. To obtain ligand similarity clusters, the ChEBIFormer-35M embeddings were clustered using k-nearest neighbors with a cosine similarity metric. The resulting clusters were then used to plot the distribution of the k_cat and K_i true and predicted values.

Distribution of true and predicted k_cat and K_i values across ligand similarity bins. The violin plots show the distribution of log₁₀k_cat (A,B) and log₁₀K_i (C,D) for different ligand clusters. Panels (A,C) display the true values, while panels (B,D) show the predicted values from the SELFprot-base model.

Results and Discussion

Results

The SELFprot models were finetuned using the CatPred database for k_cat and K_m predictions. The CatPred database for benchmarking enzyme kinetic parameter predictions is derived from BRENDA and SABIO-RK databases.^11,49−51 The K_i data set was derived from CatPred-DB and supplemented with some data from BindingDB. SELFprot demonstrated a robust predictive performance across multiple enzyme-ligand parameters.

A two-dimensional (2D) projection of the 480-D test set from the CatPred-DB can be seen in Figure 8. The t-SNE projection shows the ligand embedding from the CheBIFormer-35M model along with the protein and ligand combined embedding from the SELFprot model and finally the protein embedding from the ESM2–35M model (Figure 9). The protein and ligand and protein embeddings are color coded by their EC class. The protein embeddings show some clustering among proteins in the same EC class, but globally, the EC class is not enough to explain the clustering in the entire test set. The clustering of EC classes in the ligand–protein embeddings is less pronounced due to the potential for many-to-many mapping between proteins and ligands.

t-SNE projection of ligand, protein–ligand, and protein embeddings colored by EC class. The plots show the 2D t-SNE projections of embeddings of the test set. The left plot represents ligand-only embeddings from ChEBIFormer-35M, the middle plot shows joint protein–ligand embeddings from SELFprot-base, and the right plot displays protein-only embeddings from ESM2–35M. Points are colored according to their EC class.

Distribution of UniProt proteins with catalytic activity and existence evidence at transcript or protein or predicted level with active site, visualized through dimensionality reduction. Proteins were initially embedded using the ESM2–35M protein language model. The resulting high-dimensional space was projected into 2D using Python package UMAP and displayed using matplotlib hexbin with default values. Color intensity represents the angle between each UniProt protein embedding and its nearest neighbor in the CatPred data set.

Figures 10–13 provide an overview of the SELFprot-base (Figure 10), SELFprot-Lora (Figures 11 and 12), and SELFprot-Full (Figure 13) model performances on predicting enzyme kinetics and classification tasks. The performance metrics shown are for inhibition constants (K_i), dissociation constants (K_d), half-maximal inhibitory concentration (IC₅₀), half-maximal effective concentration (EC₅₀), and binary functional residue classification tasks, including a ROC curve and a PR curve. The top left plots show the predicted values of the inhibition constant (K_i) plotted against the true values. The coefficients of determination (R²) are found to be 0.43, 0.44, 0.45, and 0.44 for the SELFprot-base, −Lora (r = 2), −Lora (r = 6), and -Full models, respectively, indicating moderate predictive accuracy. The model’s predictions do not completely capture the variance in the experimental data, and the spread around the regression line suggests systematic errors or unaccounted variance. Despite a positive correlation, a significant amount of variance remains unexplained, suggesting limitations in the current data set in representing K_i of protein space. The top middle plots display the predicted dissociation constant (K_d) against the true values. The model achieves a R² of 0.96 for the -base and -LoRA models and a slightly better value of 0.97 for the SELFprot-Full model, indicating a high level of accuracy in predicting K_d. However, a few outliers indicate cases where the model fails to predict accurately, possibly due to insufficient training data for specific sequences. The top right plots illustrate the predicted IC₅₀ values versus the true values, with an R² of 0.99 for all SELFprot models. This high R² value suggests that the model is very effective at predicting the IC₅₀, with minimal deviation from the true values. This performance is likely due to highly similar protein and ligand sequences in the available experimental data sets. The bottom left plots show the model’s prediction of EC₅₀ values against the true values, with an R² of 0.99 across all models. Similar to IC₅₀, the accuracy of the EC₅₀ predictions is likely due to low diversity in the available experimental data set that is not representative of protein space. The bottom middle plots show the ROC curve, which assesses the binary functional site residue classification capability of the model. The ROC-AUC value ranges from 0.801 for the base model to 0.810 for the SELFprot-Full model, indicating a good overall classification performance on the majority negative class. The bottom right plots present the PR curve, which is particularly useful for imbalanced data sets. The PR-AUC ranges from 0.114 to 0.136, suggesting that while the model can make some correct positive predictions, there are challenges in maintaining high precision and recall simultaneously.

Comparative assessment of predicted and experimental logarithmic K_i, K_d, IC₅₀, and EC₅₀ values from binding-DB and functional site prediction ROC and PR-AUC for SELFprot-base model.

The performance of the SELFprot model variations was evaluated in predicting enzyme kinetic parameters (k_cat, K_m, and K_i) across different model configurations: SELFprot-base, SELFprot-LoRA (r = 2), SELFprot-LoRA (r = 6), SELFprot-Full, and SELFprot-Ensemble. The key metric used for the model evaluation was the R² value. Figure 14 presents the R² scores for predicting k_cat, K_m, and K_i values across different SELFprot configurations. For k_cat prediction, SELFprot-Ensemble demonstrated the best performance, achieving an R² of 0.540, followed closely by SELFprot-LoRA (r = 2) (R² = 0.539) and SELFprot-Full (R² = 0.538). The performances of SELFprot-LoRA (r = 6) and the SELFprot-base model were similar, with R² values of 0.535 and 0.541, respectively, indicating that while all models performed comparably, the ensemble finetuning approach slightly enhanced predictive capability. For K_m predictions, SELFprot-Ensemble outperformed other configurations with an R² of 0.522. SELFprot-Full and SELFprot-LoRA (r = 2) also exhibited competitive performances, with R² values of 0.510 and 0.503, respectively. SELFprot-LoRA (r = 6) and SELFprot-base model achieved R² scores of 0.500 and 0.498, respectively, showing modest improvements when incorporating LoRA tuning and full finetuning. Prediction of the K_i values proved to be more challenging for all model variations. The highest R² value for K_i was obtained by SELFprot-LoRA (r = 6) (R² = 0.405), followed by SELFprot-full (R² = 0.391). SELFprot-base and SELFprot-LoRA (r = 2) achieved R² values of 0.366 and 0.364, respectively. The ensemble SELFprot model had the lowest performance with an R² of 0.108. These results indicate that for K_i, full fine-tuning and LoRA tuning approaches provided significant improvements over the ensemble model. Figure 15 explores the model performance for predicting K_cat and K_m at different maximum sequence identity cutoffs (≤40%, ≤60%, ≤80%, and ≤99%). For K_cat, SELFprot-Ensemble consistently outperformed other models at all sequence identity cut-offs, achieving the highest R² of 0.359 at the ≤99% cutoff. SELFprot-LoRA (r = 2) and SELFprot-Full also showed improvements over the baseline SELFprot model at all levels of sequence similarity, indicating the generalizability of the models when trained with LoRA finetuning or full parameter updates. Notably, the SELFprot-base model exhibited lower performance, particularly at lower sequence identity cut-offs, suggesting that incorporating additional fine-tuning strategies significantly enhances predictive capabilities for distantly related sequences. For K_m, SELFprot-Ensemble again showed the best performance, with an R² of 0.394 at the ≤99% cutoff. SELFprot-Full, SELFprot-LoRA (r = 6), and SELFprot-LoRA (r = 2) achieved comparable performances, particularly at the ≤80% and ≤99% cut-offs, indicating that K_m predictions are more robust to differences in training strategy. Similar to K_cat predictions, the SELFprot-base model showed lower performance across all sequence identity cut-offs, with notable differences compared to the ensemble model. These results underscore the benefit of ensemble learning in improving the predictive accuracy across a range of sequence identities. Overall, SELFprot-base had reasonable performance on the CatPred data set, as shown in Figure 14. All of the models had similar performance on the functional residue classification task. SELFprot-LoRA (r = 6) resulted in a slightly better overall performance than the base model on regression tasks; however, SELFprot-Full had the best performance on the functional site classification task. The SELFprot models demonstrate slightly lower accuracy but comparable performance to UniKP⁵² and CatPred11 on the out of distribution test set, as shown in Figure 14, even though SELFprot has an order of magnitude fewer model parameters for the protein and ligand embedding as well as fewer trainable parameters from the use of LoRA finetuning. The smaller model size improves computational, memory, and training efficiency compared to other models, as shown in Figure 15.

Comparative performance of SELFprot models on biochemical task finetuning. This bar graph displays R² performance metrics for five configurations of SELFprot models: -base, -LoRA (r = 2), -LoRA (r = 6), -Full, -Ensemble. The models were evaluated across three finetuning tasks: (k_i), (K_m), and (k_cat). Each bar represents the average R² score achieved by the models on the respective tasks based on the CatPred-DB test set.

Performance of SELFprot models on predicting out-of-distribution protein sequences for (k_cat) and (K_m). Panel (a) illustrates the R² values for k_cat and panel (b) for K_m across various sequence identity thresholds (≤40%, ≤60%, ≤80%, and ≤99%) using different SELFprot configurations: -base, -LoRA (r = 2), -LoRA (r = 6), -Full, -Ensemble. The results highlight model robustness against sequence divergence, demonstrating small variations in prediction accuracy with creasing sequence similarity.

Discussion

The smaller ESM2–35M model combined with ChEBIFormer-35M was used as a pretrained model in the process of predicting enzyme kinetic parameters and protein–ligand binding interactions. The results demonstrate that increasing the number of trainable parameters, as in the SELFprot-Full model, does not always correlate with improved model performance, especially when the available data sets are small. Model complexity must be carefully balanced against the available data to avoid overfitting. In the case of enzyme kinetics, smaller models, such as SELFprot-LoRA (r = 6), benefit from a reduction in parameters that mitigates the risk of overfitting while retaining predictive accuracy. Using smaller efficient transformer models offers several computational advantages while maintaining performance comparable to that of larger, more complex models. Based on a comparison to the CatPred model, increasing the size of the pretrained transformer models would not yield any significant improvement without additional experimental data sets for k_cat and K_m. This inherent limitation posed by the size and diversity of the available data was noted in the CatPred study.¹¹ On the out of distribution K_m data sets, SELFprot-LoRa (r = 6), the model with fewer trainable parameters, outperformed SELFprot-Full, which had the most trainable parameters, suggesting that larger models tend to overfit on the small training data set. Similarly, SELFprot-Full was the worst overall model for the k_cat out-of-distribution data set, which was even smaller than the equivalent K_m data set. Each SELFprot model exhibited varying degrees of proficiency across different tasks and data sets, underscoring the advantages of employing consensus models^53,54 and a variety of finetuning methods. SELFprot architecture offers significant flexibility, allowing for rapid adaptation and improvement as new data become available, LoRA weights in particular are beneficial for low computation cost finetuning. The competitive performance of SELFprot-LoRA models also underscores the value of parameter-efficient finetuning strategies in low-resource settings. These methods not only reduce computational costs but also help maintain robust generalization capabilities. The LoRA approach proved effective in capturing the underlying structure of enzyme kinetics, indicating that targeted modifications to a pretrained model can outperform naive scaling up of model parameters. SELFprot models are well-suited for multitask learning scenarios where data are sparse and incomplete. This highlights the importance of designing models that can efficiently transfer knowledge across related tasks—a key advantage of transformer architectures when applied to biological data. Multitask learning was used to finetune the model since enzyme data sets tend to have missing parameters for some enzyme-substrate complex.⁵⁵ Using a multitask method was expected to increase the transfer of knowledge between the tasks. There were clear cases where the inclusion of additional tasks improved the overall model predictions, especially in the case of k_cat and K_i. However, when multitask learning is used on noisy data sets or data sets with an imbalanced training sets for each task, then the resulting model could encode this bias.⁵⁶ The addition of the functional site classification task significantly improved the K_i prediction task but also resulted in a reduction in the precision of the classification task. The ensemble model provides an approach to combine multiple models to reduce this bias. By leveraging consensus predictions, the SELFprot-Ensemble model managed to achieve more stable and accurate predictions, especially in out-of-distribution scenarios. This points to the potential of ensemble learning to enhance the robustness of predictions in the face of variability and noise in biological data sets. Future work could explore more sophisticated ensemble techniques to further improve the reliability of enzyme kinetic predictions, such as K_i. The generative capabilities of SELFprot also present exciting opportunities for metabolic models and optimizing single-point mutations in enzyme engineering.^32,57−59 By the generation of new small molecules or prediction of beneficial mutations to protein scaffolds, SELFprot can assist in the design of novel complexes with improved kinetic parameters. Furthermore, the embeddings produced by the final transformer layer, which combines information from both protein sequences and small molecule representations, hold potential for applications beyond enzymatic predictions. These embeddings could be finetuned for tasks such as predicting toxicity, bioavailability, and even metabolic pathways, demonstrating the SELFprot’s versatility and extendibility to other domains in computational biology and cheminformatics.⁶⁰⁻⁶²

The SELFprot model was able to predict a diverse set of kinetic and binding parameter values independent of ligand similarity clusters, protein sequence similarity, and EC classes. The choice of molecular representation also did not have any significant impact on the accuracy of the predictions. However, the SELFIES/SMILES chemical language model allows flexibility in molecular encoding that best suits specific needs.⁶³ Additionally, the modular nature of the architecture ensures the pLM can be upgraded to a larger ESM2 model with very little finetuning cost. The radar plot in Figure 16 is used to show how SELFprot excels in computational and training efficiency compared to CatPred and UniKP, which suggests that the SELFprot architecture is well-designed for scalability. However, the predictive accuracy of SELFprot, while promising, still leaves room for improvement. The functional site classification task in particular showed an over-reliance on prediction of the negative class for the high accuracy. The performance of the SELFprot model on the classification task will be improved in future work with the use of a cross-docked data set. This highlights the importance of continued data set expansion and the incorporation of more diverse biochemical contexts. Expanding the training data could help the model better understand the broad spectrum of enzyme-ligand interactions, ultimately improving its predictive capabilities.

Excellent, good, fair, and poor coordinate points indicate the general performance of different machine learning approaches used to predict enzyme catalytic parameters.

Conclusions

Our model exhibits comparable performance to current state-of-the-art models with an order of magnitude fewer parameters, but the major constraint in predicting enzyme kinetic parameters is the limited size and diversity of the current data set; however, SELFprot integration of innovative techniques like LoRA and SELFIES representation holds promise for ongoing rapid improvements. Future work on a strategically curated computational data set of docking scores on large variant libraries may expand SELFprot’s predictive capabilities to better functional residue identification and substrate specificity for some enzyme classes. Future works include incorporating SELFprot as a tool in LLM multiagent workflows to make efficient predictions on protein–ligand interaction at inference time.

Data Availability Statement

The trained model weights and architecture for the SELFprot model can be found at 10.5281/zenodo.14266071 and https://github.com/marltanwilson/SELFprot, respectively. CatPred-DB data set can be found at https://github.com/maranasgroup/CatPred-DB/tree/main/datasets/splits.

Author Contributions

M.W., A.W., and T.C. contributed equally to this work.

The authors declare no competing financial interest.

References

Jacob L.; Vert J. P. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24, 2149–2156. 10.1093/bioinformatics/btn409. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li F.; Yuan L.; Lu H.; Li G.; Chen Y.; Engqvist M. K.; Kerkhoven E. J.; Nielsen J. Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 2022, 5 (5), 662–672. 10.1038/s41929-022-00798-z. [DOI] [Google Scholar]
Vaswani A.; Brain G.; Shazeer N.; Parmar N.; Uszkoreit J.; Jones L.; Gomez A. N.; Łukasz K.; Polosukhin I.. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30,6000-6010. [Google Scholar]
Bagal V.; Aggarwal R.; Vinod P. V.; Priyakumar U. D. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2021, 62, 2064. 10.1021/acs.jcim.1c00600. [DOI] [PubMed] [Google Scholar]
Shrivastava A. D.; Swainston N.; Samanta S.; Roberts I.; Muelas M. W.; Kell D. B. Massgenie A transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 2021, 11, 1793. 10.3390/biom11121793. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tysinger E. P.; Rai B. K.; Sinitskiy A. V. Can We Quickly Learn to “Translate” Bioactive Molecules with Transformer Models?. J. Chem. Inf. Model. 2023, 63, 1734–1744. 10.1021/acs.jcim.2c01618. [DOI] [PubMed] [Google Scholar]
Jumper J.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baek M.; et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871–876. 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baek M.; Mchugh R.; Anishchenko I.; Jiang H.; Baker D.; Dimaio F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 2024, 21, 117–121. 10.1038/s41592-023-02086-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Z.; Akin H.; Rao R.; Hie B.; Zhu Z.; Lu W.; Smetanin N.; Verkuil R.; Kabeli O.; Shmueli Y.; dos Santos Costa A.; Fazel-Zarandi M.; Sercu T.; Candido S.; Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
Maranas C.; Boorla V. S.. CatPred: A Comprehensive Framework for Deep Learning in Vitro Enzyme Kinetic Parameters Kcat; Km and Ki, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hie B.; Candido S.; Lin Z.; Kabeli O.; Rao R.; Smetanin N.; Sercu T.; Rives A.; Candido S.; Lin Z.; Kabeli O.; Rao R.; Smetanin N.; Sercu T.; Rives A. A high-level programming language for generative protein design. BioRxiv 2022, 10.1101/2022.12.21.521526. [DOI] [Google Scholar]
Krenn M.; Häse F.; Nigam A. K.; Friederich P.; Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): A 100 representation. Mach. Learn. Sci. Technol. 2020, 1, 045024. 10.1088/2632-2153/aba947. [DOI] [Google Scholar]
Yüksel A.; Ulusoy E.; Ünlü A.; Doğan T. SELFormer: molecular representation learning via SELFIES language models. Mach. Learn. Sci. Technol. 2023, 4, 025035. 10.1088/2632-2153/acdb30. [DOI] [Google Scholar]
Piao S.; Choi J.; Seo S.; Park S. SELF-EdiT: Structure-constrained molecular optimization using SELFIES editing transformer. Appl. Intell. 2023, 53, 25868–25880. 10.1007/s10489-023-04915-8. [DOI] [Google Scholar]
Skinnider M. A. Invalid SMILES are beneficial rather than detrimental to chemical language models. Nat. Mach. Intell. 2024, 6 (6), 437–448. 10.1038/s42256-024-00821-x. [DOI] [Google Scholar]
Cheng A. H.; Cai A.; Miret S.; Malkomes G.; Phielipp M.; Aspuru-Guzik A. Group SELFIES: a robust fragment-based molecular string representation. Digital Discovery 2023, 2, 748–758. 10.1039/D3DD00012E. [DOI] [Google Scholar]
Nigam A.; Pollice R.; Krenn M.; Gomes G. D. P.; Aspuru-Guzik A. Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chem. Sci. 2021, 12, 7079–7090. 10.1039/D1SC00231G. [DOI] [PMC free article] [PubMed] [Google Scholar]
Keles F. D.; Wijewardena P. M.; Hegde C.; Agrawal S.; Orabona F.. On The Computational Complexity of Self-Attention. 2023, https://proceedings.mlr.press/v201/duman-keles23a.html. (accessed 2024-03-12).
Li Z.; Wallace E.; Shen S.; Lin K.; Keutzer K.; Klein D.; Gonzalez J. E.. Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. 2020, https://proceedings.mlr.press/v119/li20m.html. (accessed 2024-03-12)
Zeng S.; Wang D.; Jiang L.; Xu D. Parameter-Efficient Fine-Tuning on Large Protein Language Models Improves Signal Peptide Prediction. Genome Res. 2024, 34, 1445–1454. 10.1101/2023.11.04.565642. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sledzieski S.; Kshirsagar M.; Baek M.; Dodhia R.; Lavista Ferres J.; Berger B. Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. Proc. Natl. Acad. Sci. U. S. A 2024, 121, e2405840121 10.1073/pnas.2405840121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geffen Y.; Ofran Y.; Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics 2022, 38, ii95–ii98. 10.1093/bioinformatics/btac474. [DOI] [PubMed] [Google Scholar]
Wang Q.; Wang B.; Xu Z.; Wu J.; Zhao P.; Li Z.; Wang S.; Huang J.; Cui S.. PSSM-distil: Protein secondary structure prediction (PSSP) on low-quality PSSM by knowledge distillation with contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021; Vol. 35, pp 617–625.
Vu M. H.; Akbar R.; Robert P. A.; Swiatczak B.; Sandve G. K.; Greiff V.; Trygve D.; Haug T.. Adapt-and-distill: Developing small fast and effective pretrained language models for domains. arXiv preprint arXiv:2106.13474. https://arxiv.org/abs/2106.13474.
Du H. G.; Hu Y. SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing. Lect. Notes Comput. Sci. 2020, 12575 (LNCS), 193–201. 10.1007/978-3-030-66046-8_16. [DOI] [Google Scholar]
Hu E. J.; Shen Y.; Wallis P.; Allen-Zhu Z.; Li Y.; Wang S.; Wang L.; Chen W.. LoRA: Low-Rank Adaptation of Large Language Models. 2022, https://www.microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models/ .(accessed 2024-04-05)
Davidi D.; Milo R. Lessons on enzyme kinetics from quantitative proteomics. Curr. Opin. Biotechnol. 2017, 46, 81–89. 10.1016/j.copbio.2017.02.007. [DOI] [PubMed] [Google Scholar]
Kang H.; Goo S.; Lee H.; Chae J. W.; Yun H. Y.; Jung S. Fine-tuning of BERT Model to Accurately Predict Drug–Target Interactions. Pharmaceutics 2022, 14, 1710. 10.3390/pharmaceutics14081710. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alberga D.; Lamanna G.; Graziano G.; Delre P.; Lomuscio M. C.; Corriero N.; Ligresti A.; Siliqi D.; Saviano M.; Contino M.; Stefanachi A.; Mangiatordi G. F. DeLA-DrugSelf: Empowering multi-objective de novo design through SELFIES molecular representation. Comput. Biol. Med. 2024, 175, 108486. 10.1016/j.compbiomed.2024.108486. [DOI] [PubMed] [Google Scholar]
Kotkondawar R. R.; Sutar S. R.; Kiwelekar A. W.; Kadam V. J.. Integrating Transformer-based Language Model for Drug Discovery. Proceedings of the 18th INDIAcom; 2024 11th International Conference on Computing for Sustainable Global Development, INDIACom 2024; IEEE, 2024; pp 1096–1101.
Mao J.; Wang J.; Zeb A.; Cho K. H.; Jin H.; Kim J.; Lee O.; Wang Y.; No K. T. Transformer-Based Molecular Generative Model for Antiviral Drug Design. J. Chem. Inf. Model. 2024, 64, 2733–2745. 10.1021/acs.jcim.3c00536. [DOI] [PMC free article] [PubMed] [Google Scholar]
Honda S.; Shi S.; Ueda H. R.. SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv preprint arXiv:1911.04738, 2019. 10.48550/arXiv.1911.04738 [DOI] [Google Scholar]
Francoeur P. G.; Masuda T.; Sunseri J.; Jia A.; Iovanisci R. B.; Snyder I.; Koes D. R. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 2020, 60, 4200–4215. 10.1021/acs.jcim.0c00411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith D.; Helmy M.; Lindley N.; Selvarajoo K. The transformation of our food system using cellular agriculture: What lies ahead and who will lead it?. Trends Food Sci. Technol. 2022, 127, 368–376. 10.1016/j.tifs.2022.04.015. [DOI] [Google Scholar]
Bhandari S.; Poudel D. K.; Marahatha R.; Dawadi S.; Khadayat K.; Phuyal S.; Shrestha S.; Gaire S.; Basnet K.; Khadka U.; Parajuli N. Microbial Enzymes Used in Bioremediation. J. Chem. 2021, 2021, 1–17. 10.1155/2021/8849512. [DOI] [Google Scholar]
Peixoto R.; Vermelho A.; Rosado A. S. Petroleum-degrading enzymes: bioremediation and new prospects. Enzym. Res. 2011, 2011, 1–7. 10.4061/2011/475193. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutherland T. D.; Horne I.; Weir K. M.; Coppin C. W.; Williams M. R.; Selleck M.; Russell R. J.; Oakeshott J. G. Enzymatic bioremediation: from enzyme discovery to applications. Clin. Exp. Pharmacol. Physiol. 2004, 31, 817–821. 10.1111/j.1440-1681.2004.04088.x. [DOI] [PubMed] [Google Scholar]
Mousavi S. M.; Behbudi G.; Hashemi S. A.; Babapoor A.; Chiang W. H.; Ramakrishna S.; Rahman M. M.; Lai C. W.; Gholami A.; Omidifar N.; et al. Recent Progress in Electrochemical Detection of Human Papillomavirus (HPV) via Graphene-Based Nanosensors. Biochem. Res. Int. 2021, 2021, 5599204. 10.1155/2021/6673483.34401207 [DOI] [Google Scholar]
Ruggaber T. P.; Talley J. W. Enhancing bioremediation with enzymatic processes: A review. Pract. Period. Hazard. Toxic, Radioact. Waste Manag. 2006, 10, 73–85. 10.1061/(asce)1090-025x(2006)10:2(73). [DOI] [Google Scholar]
Karigar C. S.; Rao S. S. Role of microbial enzymes in the bioremediation of pollutants: a review. Enzym. Res. 2011, 2011, 1–11. 10.4061/2011/805187. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bekiaris P. S.; Klamt S. Automatic construction of metabolic models with enzyme constraints. BMC Bioinf. 2020, 21, 19. 10.1186/s12859-019-3329-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao J.; Cao Y.; Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput. Struct. Biotechnol. J. 2020, 18, 417–426. 10.1016/j.csbj.2020.02.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Du X.; Li Y.; Xia Y.-L.; Ai S.-M.; Liang J.; Sang P.; Ji X.-L.; Liu S.-Q. Insights into Protein–Ligand Interactions: Mechanisms, Models, and Methods. Int. J. Mol. Sci. 2016, 17, 144. 10.3390/ijms17020144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu T.; Lin Y.; Wen X.; Jorissen R. N.; Gilson M. K. BindingDB: A web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35, D198–D201. 10.1093/nar/gkl999. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lv L.; Lin Z.; Li H.; Liu Y.; Cui J.; Chen C. Y.-C.; Yuan L.; Tian Y.. ProLLaMA:A Protein Large Language Model for Multi-Task Protein Language Processing. arXiv e-prints, arXiv:2402.16445, 2024. [Google Scholar]
Jiang H.; Wang J.; Yang Z.; Chen C.; Yao G.; Bao S.; Wan X.; Wang L.. MPEK: A Multi-Task Learning Based on Pre-trained Language Model for Predicting Enzymatic Reaction Kinetic Parameters; Research Square, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang Y.; Wang X.; Wang J.; Wu Y.; Yang L.; Zhu K.; Chen H.; Yi X.; Wang C.; Wang Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. 10.1145/3641289. [DOI] [Google Scholar]
Wittig U.; Rey M.; Weidemann A.; Kania R.; Müller W. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 2018, 46, D656–D660. 10.1093/nar/gkx1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wittig U.; Kania R.; Golebiewski M.; Rey M.; Shi L.; Jong L.; Algaa E.; Weidemann A.; Sauer-Danzwith H.; Mir S.; Krebs O.; Bittkowski M.; Wetsch E.; Rojas I.; Müller W. SABIO-RK—database for biochemical reaction kinetics. Nucleic Acids Res. 2012, 40, D790–D796. 10.1093/nar/gkr1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schomburg I.; Jeske L.; Ulbrich M.; Placzek S.; Chang A.; Schomburg D. The BRENDA enzyme information system–from a database to an expert system. J. Biotechnol. 2017, 261, 194–206. 10.1016/j.jbiotec.2017.04.020. [DOI] [PubMed] [Google Scholar]
Yu H.; Deng H.; He J.; Keasling J. D.; Luo X. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat. Commun. 2023, 14 (1), 8211–8213. 10.1038/s41467-023-44113-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu J.; Lu D.; Fu Z.; Zheng M.; Luo X. Machine learning-based modeling of drug toxicity. Methods Mol. Biol. 2018, 1754, 247–264. 10.1007/978-1-4939-7717-8_15. [DOI] [PubMed] [Google Scholar]
Ngoc T.; Tran T.; Felfernig A.; Viet; Le M.; Felfernig A.; Le V. M.. User Modeling and User-Adapted Interaction an Overview of Consensus Models for Group Decision-Making and Group Rrecommender Systems; Research Square, 2024. [Google Scholar]
Kroll A.; Ranjan S.; Lercher M. J. A multimodal Transformer Network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships. PLoS Comput. Biol. 2024, 20, e1012100 10.1371/journal.pcbi.1012100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y.; Qiang Y. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586. 10.1109/TKDE.2021.3070203. [DOI] [Google Scholar]
Barghout R.; Xu Z.; Betala S.; Mahadevan R. Advances in generative modeling methods and datasets to design novel enzymes for renewable chemicals and fuels. Curr. Opin. Biotechnol. 2023, 84, 103007. 10.1016/j.copbio.2023.103007. [DOI] [PubMed] [Google Scholar]
Ingraham J.; Garg V.. Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst. 2019, 32, 15820–15831. [Google Scholar]
Chen H.; Bajorath J. Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model. J. Cheminf. 2024, 16, 55. 10.1186/s13321-024-00852-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Monteiro N.; Oliveira J.; Arrais J. P. DTITR: End-to-end drug–target binding affinity prediction with transformers. Comput. Biol. Med. 2022, 147, 105772. 10.1016/j.compbiomed.2022.105772. [DOI] [PubMed] [Google Scholar]
Ma J.; Zhao Z.; Li T.; Liu Y.; Ma J.; Zhang R. GraphsformerCPI: Graph Transformer for Compound–Protein Interaction Prediction. Interdiscipl. Sci. Comput. Life Sci. 2024, 16, 361–377. 10.1007/s12539-024-00609-y. [DOI] [PubMed] [Google Scholar]
Zeng X.; Chen W.; Lei B. CAT-DTI: cross-attention and Transformer network with domain adaptation for drug-target interaction prediction. BMC Bioinf. 2024, 25, 141. 10.1186/s12859-024-05753-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rajan K.; Steinbeck C.; Zielesny A.; Steinbeck C.; Zielesny A. Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery 2022, 1, 84–90. 10.1039/d1dd00013f. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[ref1] Jacob L.; Vert J. P. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24, 2149–2156. 10.1093/bioinformatics/btn409. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] Li F.; Yuan L.; Lu H.; Li G.; Chen Y.; Engqvist M. K.; Kerkhoven E. J.; Nielsen J. Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 2022, 5 (5), 662–672. 10.1038/s41929-022-00798-z. [DOI] [Google Scholar]

[ref3] Vaswani A.; Brain G.; Shazeer N.; Parmar N.; Uszkoreit J.; Jones L.; Gomez A. N.; Łukasz K.; Polosukhin I.. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30,6000-6010. [Google Scholar]

[ref4] Bagal V.; Aggarwal R.; Vinod P. V.; Priyakumar U. D. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2021, 62, 2064. 10.1021/acs.jcim.1c00600. [DOI] [PubMed] [Google Scholar]

[ref5] Shrivastava A. D.; Swainston N.; Samanta S.; Roberts I.; Muelas M. W.; Kell D. B. Massgenie A transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules 2021, 11, 1793. 10.3390/biom11121793. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Tysinger E. P.; Rai B. K.; Sinitskiy A. V. Can We Quickly Learn to “Translate” Bioactive Molecules with Transformer Models?. J. Chem. Inf. Model. 2023, 63, 1734–1744. 10.1021/acs.jcim.2c01618. [DOI] [PubMed] [Google Scholar]

[ref7] Jumper J.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Baek M.; et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871–876. 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] Baek M.; Mchugh R.; Anishchenko I.; Jiang H.; Baker D.; Dimaio F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 2024, 21, 117–121. 10.1038/s41592-023-02086-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] Lin Z.; Akin H.; Rao R.; Hie B.; Zhu Z.; Lu W.; Smetanin N.; Verkuil R.; Kabeli O.; Shmueli Y.; dos Santos Costa A.; Fazel-Zarandi M.; Sercu T.; Candido S.; Rives A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]

[ref11] Maranas C.; Boorla V. S.. CatPred: A Comprehensive Framework for Deep Learning in Vitro Enzyme Kinetic Parameters Kcat; Km and Ki, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] Hie B.; Candido S.; Lin Z.; Kabeli O.; Rao R.; Smetanin N.; Sercu T.; Rives A.; Candido S.; Lin Z.; Kabeli O.; Rao R.; Smetanin N.; Sercu T.; Rives A. A high-level programming language for generative protein design. BioRxiv 2022, 10.1101/2022.12.21.521526. [DOI] [Google Scholar]

[ref13] Krenn M.; Häse F.; Nigam A. K.; Friederich P.; Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): A 100 representation. Mach. Learn. Sci. Technol. 2020, 1, 045024. 10.1088/2632-2153/aba947. [DOI] [Google Scholar]

[ref14] Yüksel A.; Ulusoy E.; Ünlü A.; Doğan T. SELFormer: molecular representation learning via SELFIES language models. Mach. Learn. Sci. Technol. 2023, 4, 025035. 10.1088/2632-2153/acdb30. [DOI] [Google Scholar]

[ref15] Piao S.; Choi J.; Seo S.; Park S. SELF-EdiT: Structure-constrained molecular optimization using SELFIES editing transformer. Appl. Intell. 2023, 53, 25868–25880. 10.1007/s10489-023-04915-8. [DOI] [Google Scholar]

[ref16] Skinnider M. A. Invalid SMILES are beneficial rather than detrimental to chemical language models. Nat. Mach. Intell. 2024, 6 (6), 437–448. 10.1038/s42256-024-00821-x. [DOI] [Google Scholar]

[ref17] Cheng A. H.; Cai A.; Miret S.; Malkomes G.; Phielipp M.; Aspuru-Guzik A. Group SELFIES: a robust fragment-based molecular string representation. Digital Discovery 2023, 2, 748–758. 10.1039/D3DD00012E. [DOI] [Google Scholar]

[ref18] Nigam A.; Pollice R.; Krenn M.; Gomes G. D. P.; Aspuru-Guzik A. Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chem. Sci. 2021, 12, 7079–7090. 10.1039/D1SC00231G. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Keles F. D.; Wijewardena P. M.; Hegde C.; Agrawal S.; Orabona F.. On The Computational Complexity of Self-Attention. 2023, https://proceedings.mlr.press/v201/duman-keles23a.html. (accessed 2024-03-12).

[ref20] Li Z.; Wallace E.; Shen S.; Lin K.; Keutzer K.; Klein D.; Gonzalez J. E.. Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. 2020, https://proceedings.mlr.press/v119/li20m.html. (accessed 2024-03-12)

[ref21] Zeng S.; Wang D.; Jiang L.; Xu D. Parameter-Efficient Fine-Tuning on Large Protein Language Models Improves Signal Peptide Prediction. Genome Res. 2024, 34, 1445–1454. 10.1101/2023.11.04.565642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Sledzieski S.; Kshirsagar M.; Baek M.; Dodhia R.; Lavista Ferres J.; Berger B. Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. Proc. Natl. Acad. Sci. U. S. A 2024, 121, e2405840121 10.1073/pnas.2405840121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] Geffen Y.; Ofran Y.; Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics 2022, 38, ii95–ii98. 10.1093/bioinformatics/btac474. [DOI] [PubMed] [Google Scholar]

[ref24] Wang Q.; Wang B.; Xu Z.; Wu J.; Zhao P.; Li Z.; Wang S.; Huang J.; Cui S.. PSSM-distil: Protein secondary structure prediction (PSSP) on low-quality PSSM by knowledge distillation with contrastive learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021; Vol. 35, pp 617–625.

[ref25] Vu M. H.; Akbar R.; Robert P. A.; Swiatczak B.; Sandve G. K.; Greiff V.; Trygve D.; Haug T.. Adapt-and-distill: Developing small fast and effective pretrained language models for domains. arXiv preprint arXiv:2106.13474. https://arxiv.org/abs/2106.13474.

[ref26] Du H. G.; Hu Y. SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing. Lect. Notes Comput. Sci. 2020, 12575 (LNCS), 193–201. 10.1007/978-3-030-66046-8_16. [DOI] [Google Scholar]

[ref27] Hu E. J.; Shen Y.; Wallis P.; Allen-Zhu Z.; Li Y.; Wang S.; Wang L.; Chen W.. LoRA: Low-Rank Adaptation of Large Language Models. 2022, https://www.microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models/ .(accessed 2024-04-05)

[ref28] Davidi D.; Milo R. Lessons on enzyme kinetics from quantitative proteomics. Curr. Opin. Biotechnol. 2017, 46, 81–89. 10.1016/j.copbio.2017.02.007. [DOI] [PubMed] [Google Scholar]

[ref29] Kang H.; Goo S.; Lee H.; Chae J. W.; Yun H. Y.; Jung S. Fine-tuning of BERT Model to Accurately Predict Drug–Target Interactions. Pharmaceutics 2022, 14, 1710. 10.3390/pharmaceutics14081710. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] Alberga D.; Lamanna G.; Graziano G.; Delre P.; Lomuscio M. C.; Corriero N.; Ligresti A.; Siliqi D.; Saviano M.; Contino M.; Stefanachi A.; Mangiatordi G. F. DeLA-DrugSelf: Empowering multi-objective de novo design through SELFIES molecular representation. Comput. Biol. Med. 2024, 175, 108486. 10.1016/j.compbiomed.2024.108486. [DOI] [PubMed] [Google Scholar]

[ref31] Kotkondawar R. R.; Sutar S. R.; Kiwelekar A. W.; Kadam V. J.. Integrating Transformer-based Language Model for Drug Discovery. Proceedings of the 18th INDIAcom; 2024 11th International Conference on Computing for Sustainable Global Development, INDIACom 2024; IEEE, 2024; pp 1096–1101.

[ref32] Mao J.; Wang J.; Zeb A.; Cho K. H.; Jin H.; Kim J.; Lee O.; Wang Y.; No K. T. Transformer-Based Molecular Generative Model for Antiviral Drug Design. J. Chem. Inf. Model. 2024, 64, 2733–2745. 10.1021/acs.jcim.3c00536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] Honda S.; Shi S.; Ueda H. R.. SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv preprint arXiv:1911.04738, 2019. 10.48550/arXiv.1911.04738 [DOI] [Google Scholar]

[ref34] Francoeur P. G.; Masuda T.; Sunseri J.; Jia A.; Iovanisci R. B.; Snyder I.; Koes D. R. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 2020, 60, 4200–4215. 10.1021/acs.jcim.0c00411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] Smith D.; Helmy M.; Lindley N.; Selvarajoo K. The transformation of our food system using cellular agriculture: What lies ahead and who will lead it?. Trends Food Sci. Technol. 2022, 127, 368–376. 10.1016/j.tifs.2022.04.015. [DOI] [Google Scholar]

[ref36] Bhandari S.; Poudel D. K.; Marahatha R.; Dawadi S.; Khadayat K.; Phuyal S.; Shrestha S.; Gaire S.; Basnet K.; Khadka U.; Parajuli N. Microbial Enzymes Used in Bioremediation. J. Chem. 2021, 2021, 1–17. 10.1155/2021/8849512. [DOI] [Google Scholar]

[ref37] Peixoto R.; Vermelho A.; Rosado A. S. Petroleum-degrading enzymes: bioremediation and new prospects. Enzym. Res. 2011, 2011, 1–7. 10.4061/2011/475193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] Sutherland T. D.; Horne I.; Weir K. M.; Coppin C. W.; Williams M. R.; Selleck M.; Russell R. J.; Oakeshott J. G. Enzymatic bioremediation: from enzyme discovery to applications. Clin. Exp. Pharmacol. Physiol. 2004, 31, 817–821. 10.1111/j.1440-1681.2004.04088.x. [DOI] [PubMed] [Google Scholar]

[ref39] Mousavi S. M.; Behbudi G.; Hashemi S. A.; Babapoor A.; Chiang W. H.; Ramakrishna S.; Rahman M. M.; Lai C. W.; Gholami A.; Omidifar N.; et al. Recent Progress in Electrochemical Detection of Human Papillomavirus (HPV) via Graphene-Based Nanosensors. Biochem. Res. Int. 2021, 2021, 5599204. 10.1155/2021/6673483.34401207 [DOI] [Google Scholar]

[ref40] Ruggaber T. P.; Talley J. W. Enhancing bioremediation with enzymatic processes: A review. Pract. Period. Hazard. Toxic, Radioact. Waste Manag. 2006, 10, 73–85. 10.1061/(asce)1090-025x(2006)10:2(73). [DOI] [Google Scholar]

[ref41] Karigar C. S.; Rao S. S. Role of microbial enzymes in the bioremediation of pollutants: a review. Enzym. Res. 2011, 2011, 1–11. 10.4061/2011/805187. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] Bekiaris P. S.; Klamt S. Automatic construction of metabolic models with enzyme constraints. BMC Bioinf. 2020, 21, 19. 10.1186/s12859-019-3329-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] Zhao J.; Cao Y.; Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput. Struct. Biotechnol. J. 2020, 18, 417–426. 10.1016/j.csbj.2020.02.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] Du X.; Li Y.; Xia Y.-L.; Ai S.-M.; Liang J.; Sang P.; Ji X.-L.; Liu S.-Q. Insights into Protein–Ligand Interactions: Mechanisms, Models, and Methods. Int. J. Mol. Sci. 2016, 17, 144. 10.3390/ijms17020144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] Liu T.; Lin Y.; Wen X.; Jorissen R. N.; Gilson M. K. BindingDB: A web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35, D198–D201. 10.1093/nar/gkl999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] Lv L.; Lin Z.; Li H.; Liu Y.; Cui J.; Chen C. Y.-C.; Yuan L.; Tian Y.. ProLLaMA:A Protein Large Language Model for Multi-Task Protein Language Processing. arXiv e-prints, arXiv:2402.16445, 2024. [Google Scholar]

[ref47] Jiang H.; Wang J.; Yang Z.; Chen C.; Yao G.; Bao S.; Wan X.; Wang L.. MPEK: A Multi-Task Learning Based on Pre-trained Language Model for Predicting Enzymatic Reaction Kinetic Parameters; Research Square, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] Chang Y.; Wang X.; Wang J.; Wu Y.; Yang L.; Zhu K.; Chen H.; Yi X.; Wang C.; Wang Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. 10.1145/3641289. [DOI] [Google Scholar]

[ref49] Wittig U.; Rey M.; Weidemann A.; Kania R.; Müller W. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 2018, 46, D656–D660. 10.1093/nar/gkx1065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] Wittig U.; Kania R.; Golebiewski M.; Rey M.; Shi L.; Jong L.; Algaa E.; Weidemann A.; Sauer-Danzwith H.; Mir S.; Krebs O.; Bittkowski M.; Wetsch E.; Rojas I.; Müller W. SABIO-RK—database for biochemical reaction kinetics. Nucleic Acids Res. 2012, 40, D790–D796. 10.1093/nar/gkr1046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] Schomburg I.; Jeske L.; Ulbrich M.; Placzek S.; Chang A.; Schomburg D. The BRENDA enzyme information system–from a database to an expert system. J. Biotechnol. 2017, 261, 194–206. 10.1016/j.jbiotec.2017.04.020. [DOI] [PubMed] [Google Scholar]

[ref52] Yu H.; Deng H.; He J.; Keasling J. D.; Luo X. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat. Commun. 2023, 14 (1), 8211–8213. 10.1038/s41467-023-44113-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref53] Lu J.; Lu D.; Fu Z.; Zheng M.; Luo X. Machine learning-based modeling of drug toxicity. Methods Mol. Biol. 2018, 1754, 247–264. 10.1007/978-1-4939-7717-8_15. [DOI] [PubMed] [Google Scholar]

[ref54] Ngoc T.; Tran T.; Felfernig A.; Viet; Le M.; Felfernig A.; Le V. M.. User Modeling and User-Adapted Interaction an Overview of Consensus Models for Group Decision-Making and Group Rrecommender Systems; Research Square, 2024. [Google Scholar]

[ref55] Kroll A.; Ranjan S.; Lercher M. J. A multimodal Transformer Network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships. PLoS Comput. Biol. 2024, 20, e1012100 10.1371/journal.pcbi.1012100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref56] Zhang Y.; Qiang Y. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586. 10.1109/TKDE.2021.3070203. [DOI] [Google Scholar]

[ref57] Barghout R.; Xu Z.; Betala S.; Mahadevan R. Advances in generative modeling methods and datasets to design novel enzymes for renewable chemicals and fuels. Curr. Opin. Biotechnol. 2023, 84, 103007. 10.1016/j.copbio.2023.103007. [DOI] [PubMed] [Google Scholar]

[ref58] Ingraham J.; Garg V.. Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst. 2019, 32, 15820–15831. [Google Scholar]

[ref59] Chen H.; Bajorath J. Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model. J. Cheminf. 2024, 16, 55. 10.1186/s13321-024-00852-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref60] Monteiro N.; Oliveira J.; Arrais J. P. DTITR: End-to-end drug–target binding affinity prediction with transformers. Comput. Biol. Med. 2022, 147, 105772. 10.1016/j.compbiomed.2022.105772. [DOI] [PubMed] [Google Scholar]

[ref61] Ma J.; Zhao Z.; Li T.; Liu Y.; Ma J.; Zhang R. GraphsformerCPI: Graph Transformer for Compound–Protein Interaction Prediction. Interdiscipl. Sci. Comput. Life Sci. 2024, 16, 361–377. 10.1007/s12539-024-00609-y. [DOI] [PubMed] [Google Scholar]

[ref62] Zeng X.; Chen W.; Lei B. CAT-DTI: cross-attention and Transformer network with domain adaptation for drug-target interaction prediction. BMC Bioinf. 2024, 25, 141. 10.1186/s12859-024-05753-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref63] Rajan K.; Steinbeck C.; Zielesny A.; Steinbeck C.; Zielesny A. Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery 2022, 1, 84–90. 10.1039/d1dd00013f. [DOI] [Google Scholar]

PERMALINK

SELFprot: Effective and Efficient Multitask Finetuning Methods for Protein Parameter Prediction

Marltan Wilson

Thomas Coudrat

Andrew Warden

Abstract

Introduction

Methods

Architecture Design

Figure 1.

Figure 2.

Protein and Small Molecule Encoding

Finetuning Classification and Regression Analysis

Model Evaluation

Figure 3.

Figure 4.

Table 1. Impact of Excluding Tasks on Prediction Errors for Protein–Ligand Interaction Parameters in the SELFprot-Base Multitask Learning Modela.

Table 2. Percentage of Protein–Ligand Complexes with Predicted Values Within One Order of Magnitude of the Experimental Value of kcat, Km Predicted by SELFprot, SELFprot-LoRa, SELFprot-LoRa (r = 6), SELFprot-Full, and SELFprot-Ensemble.

Figure 5.

Figure 6.

Figure 7.

Results and Discussion

Results

Figure 8.

Figure 9.

Figure 10.

Figure 13.

Figure 11.

Figure 12.

Figure 14.

Figure 15.

Discussion

Figure 16.

Conclusions

Data Availability Statement

Author Contributions

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Impact of Excluding Tasks on Prediction Errors for Protein–Ligand Interaction Parameters in the SELFprot-Base Multitask Learning Model^a.

Table 2. Percentage of Protein–Ligand Complexes with Predicted Values Within One Order of Magnitude of the Experimental Value of k_cat, K_m Predicted by SELFprot, SELFprot-LoRa, SELFprot-LoRa (r = 6), SELFprot-Full, and SELFprot-Ensemble.