Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2025 Mar 24;65(7):3761–3770. doi: 10.1021/acs.jcim.4c02291

A Deep Retrieval-Enhanced Meta-Learning Framework for Enzyme Optimum pH Prediction

Liang Zhang †,, Kuan Luo , Ziyi Zhou §, Yuanxi Yu , Fan Jiang , Banghao Wu , Mingchen Li †,⊥,*, Liang Hong †,§,#,∇,‡,*
PMCID: PMC12005191  PMID: 40127128

Abstract

graphic file with name ci4c02291_0005.jpg

The potential of hydrogen (pH) influences the function of the enzyme. Measuring or predicting the optimal pH (pHopt) at which enzymes exhibit maximal catalytic activity is crucial for enzyme design and application. The rapid development of enzyme mining and de novo design has produced a large number of new enzymes, making it impractical to measure their pHopt in the wet laboratory. Consequently, in-silico computational approaches such as machine learning and deep learning models, which offer pH prediction at minimal cost, have attracted considerable interest. This work presents Venus-DREAM, an enzyme pHopt prediction model based on the kNN algorithm and few-shot learning, which achieves state-of-the-art accuracy in pHopt prediction. Venus-DREAM regards the pHopt prediction of an enzyme as a few-shot learning task: learning from the k-closest labeled enzymes to predict the pHopt of the target enzyme. The value of k is determined by the optimal k-value of the kNN regression algorithm. And the distance between two enzymes is defined as the cosine similarity of their mean-pooled embeddings obtained from protein language models (PLMs). The few-shot learner is based on the Reptile algorithm, which first adapts to the k-nearest labeled enzymes to create a specialized model for the target enzyme and then predicts its pHopt. This efficient method enables high-throughput virtual exploration of protein space, facilitating the identification of sequences with the desired pHopt ranges in a high-throughput manner. Moreover, our method can be easily adapted in other protein function prediction tasks.

Introduction

Enzymes are essential biocatalysts in living systems, with their catalytic efficiency being strongly influenced by environmental pH.1 The relationship between pH and enzyme activity is particularly critical in biotechnology, where enzymes must maintain high activity under specific process conditions.25 The pH-dependent behavior of enzymes is governed by complex molecular mechanisms involving both catalytic residues and structural elements.68 At the molecular level, changes in pH affect the protonation states of amino acid residues, particularly those in the active site, which can directly impact substrate binding and catalysis.912 Additionally, pH changes can alter protein surface charge distribution and internal electrostatic interactions, potentially leading to conformational changes that affect enzyme stability and activity.1316 These molecular adaptations have evolved differently across enzyme families, resulting in diverse pH optima ranging from highly acidic (pH < 3) to strongly alkaline (pH > 10) conditions.17 Understanding and predicting enzyme pH dependence is crucial for applications ranging from food processing to pharmaceutical manufacturing, where suboptimal pH conditions can significantly reduce enzyme performance.18,19 However, traditional experimental methods for determining optimal pH (pHopt) are time-consuming and resource-intensive, creating a bottleneck in enzyme characterization and engineering.20,21

Computational methods have emerged as promising alternatives for pHopt prediction. Early approaches focused on discriminating between acidic and alkaline enzymes using simple sequence-derived features such as amino acid composition and physicochemical properties.2224 More recent methods have attempted direct pHopt prediction using traditional machine learning approaches such as support vector machines and random forests.25 The emergence of protein language models (PLMs) has revolutionized sequence-based prediction tasks by learning contextual representations from millions of protein sequences.26 By capturing both local and global sequence patterns, these models have enabled more sophisticated pH prediction approaches. For instance, EpHod leverages a semisupervised PLM pretrained on sequences labeled with their source organisms’ optimal growth pH,27 while Zaretckii et al. further enhanced PLM-based prediction through advanced machine learning techniques.28

While PLMs have improved prediction accuracy, recent advances in retrieval-augmented methods suggest potential for further enhancement.2931 These methods dynamically leverage similar sequences from large databases to provide task-specific context. However, current pH prediction approaches have yet to combine the advantages of PLMs with retrieval-enhanced learning and meta-learning techniques. Such integration could enable rapid adaptation to new enzyme families through transfer learning and few-shot learning,32,33 while effectively leveraging evolutionary relationships between enzymes.3436

To address these challenges, we propose Venus-Deep Retrieval-Enhanced Adaptive Meta-learning (DREAM), a novel framework that combines k-nearest neighbors (kNN)37 algorithm with few-shot learning38 for enzyme pHopt prediction. Our approach makes several key contributions: (1) it improves upon EpHod’s pH prediction accuracy through an enhanced model architecture;27 (2) it employs a retrieval-enhanced strategy39 to identify evolutionarily related enzymes; and (3) it utilizes meta-learning32 to enable rapid adaptation to new enzyme with limited data.

Results

We present a retrieval-enhanced framework for predicting the optimal pH of enzymes, treating it as a few-shot learning task. Our approach, Venus-DREAM, is a deep retrieval-enhanced adaptive meta-learning framework designed to predict enzyme optimal pH (see Figure 1). Venus-DREAM leverages the ESM-2 protein language model40 to generate enzyme sequence embeddings, combined with a k-nearest neighbors (kNN)37 algorithm and few-shot learning approach. The framework consists of two main components: (1) a similarity-based retrieval module that identifies evolutionarily related enzymes as the support set, and (2) a meta-learning module based on the Reptile algorithm41 that enables rapid adaptation to new enzyme.

Figure 1.

Figure 1

Overview of the Venus-DREAM framework. (A) Construction of support set through similarity-based retrieval. The ESM-2 model encodes both target enzyme and database sequences into 1280-dimensional vectors for similarity computation. (B) Training process of Reptile algorithm for pH prediction. The model iteratively performs inner loop adaptation on different support sets to learn task-specific parameters, followed by meta-updates that move the global parameters toward these adapted states. (C) Inference process using the meta-trained model to predict pHopt for a target enzyme. (D) Comparison between our meta-learning-based approach and traditional k-NN method for pH prediction.

The model was evaluated on the benchmark data set established by EpHod,27 containing 9,855 enzymes with experimentally validated pHopt values. Following the same protocol as EpHod, we used their predefined data splits to ensure fair comparison. Given the train-validation-test sequences, we first processed protein sequences using the ESM-2 model to obtain 1280-dimensional embeddings through mean pooling, followed by support set construction through similarity-based retrieval and meta-learning optimization. Hereinafter, we present comprehensive analyses of our model’s performance, including comparisons with baseline methods and detailed ablation studies examining the impact of different components. To ensure the robustness of our performance evaluation, we applied 5-fold cross-validation to Venus-DREAM and its variants, providing multiple independent measurements to enhance statistical reliability. Performance metrics are reported as averages across the five folds, accompanied by standard errors (error bars) to reflect variability. The baseline EpHod, due to its training code not being open-sourced, could not be subjected to cross-validation; we performed a single test using its provided model, thus no error bars are available.

Overall Performance Comparison

We first evaluated Venus-DREAM, our proposed training strategy that enhances EpHod’s performance through retrieval-enhanced meta-learning. Inspired by the effectiveness of similarity-based methods demonstrated in recent work,28 where a simple kNN approach using ESM-2 embeddings showed strong predictive power, we developed a meta-learning framework that builds upon the pretrained EpHod model. Our framework retrieves the five most similar enzyme sequences (top-5) from the training set based on ESM-2 embedding cosine similarity to construct the support set for each target enzyme. This choice of k = 5 is supported by our comprehensive analysis of different k values (Supporting Figure 1), which shows optimal performance across all metrics at k = 5.

As shown in Table 1, our proposed training strategy demonstrates remarkable effectiveness in improving EpHod’s performance across multiple evaluation metrics. We started with the baseline EpHod model (MSE = 0.801) and progressively developed more sophisticated approaches. Our initial exploration focused on a straightforward KNN-based method (KNN, MSE = 0.761 ± 0.005), which predicts an enzyme’s pHopt by computing the average pH value of its five most similar proteins, identified through cosine similarity in the ESM-2 embedding space. The notable performance improvement achieved by this simple similarity-based approach reveals a fundamental characteristic of enzyme pHopt prediction: enzymes with similar sequence features tend to exhibit similar pHopt values. This key observation provides strong empirical evidence for the potential of similarity-based methods and motivates us to develop more sophisticated approaches to leverage this similarity information. Building upon this insight, we reformulated the pH prediction task within a few-shot learning framework, where similarity-based retrieved proteins serve as support sets to guide the model’s adaptation. This meta-learning strategy enables more sophisticated optimization of EpHod’s parameters through support set learning, allowing the model to effectively capture the relationship between sequence similarity and pHopt preferences.

Table 1. Performance Comparison of Different Models.

model MSE↓ RMSE↓ MAE↓ R2 Spearman ↑
EpHod27 0.801 0.895 0.657 0.399 0.548
ESM2 (k-NN) 0.761 ± 0.005 0.872 ± 0.003 0.622 ± 0.002 0.429 ± 0.004 0.559 ± 0.003
Venus-DREAM (MAML) 0.696 ± 0.030 0.834 ± 0.018 0.602 ± 0.019 0.478 ± 0.023 0.606 ± 0.018
Venus-DREAM (Reptile) 0.654 ± 0.012 0.809 ± 0.007 0.578 ± 0.008 0.509 ± 0.009 0.629 ± 0.007

The effectiveness of our proposed approach is comprehensively demonstrated through a series of progressive performance improvements across all evaluation metrics. Our MAML-based implementation (Venus-DREAM (MAML)) achieves a significant improvement with an MSE of 0.696 ± 0.030, while the Reptile-based version (Venus-DREAM (Reptile)) further enhances the performance to achieve an MSE of 0.654 ± 0.012, representing a substantial 18.4% improvement over the original EpHod model. The training process of Venus-DREAM (Reptile) exhibits stable convergence (Supporting Figure 2), characterized by a rapid initial decrease in loss followed by consistent optimization, indicating effective meta-learning. This consistent pattern of enhancement is reflected across multiple performance metrics: the RMSE shows a notable reduction from 0.895 to 0.809 ± 0.007 (9.6% improvement), while the MAE demonstrates a significant decrease from 0.657 to 0.578 ± 0.008 (12.0% improvement). Particularly noteworthy are the substantial improvements in correlation metrics, which indicate enhanced capability in capturing the underlying relationships between enzyme sequences and their pHopt values. Specifically, the Reptile value shows a remarkable increase from 0.399 to 0.509 ± 0.009, while the Spearman correlation coefficient demonstrates significant improvement, rising from 0.548 to 0.629 ± 0.007. The statistical significance of these improvements is further validated through detailed analysis (Supporting Figure 3), where a paired t test on the test data set (n = 1971) reveals significantly lower prediction errors (t = – 8.94, p = 8.55 × 10–19), though the effect size (Cohen’s d = −0.20) suggests the magnitude of improvement is modest. These comprehensive improvements across different evaluation metrics strongly validate the effectiveness of our proposed meta-learning framework (Figure 2).

Figure 2.

Figure 2

Venus-DREAM (Reptile) prediction vs true pH.

Impact of Support Set Selection Strategies

To understand the importance of proper support set selection in our meta-learning framework, we conducted control experiments with different selection strategies (Table 2). Specifically, we compared our similarity-based selection with three alternative strategies: random selection, which randomly samples support sets for each adaptation step; fixed random selection, which uses the same randomly selected support sets throughout training and testing; and a train-test mismatch scenario where the model is trained with similarity-based selection but tested with random selection. To ensure robust evaluation, all experiments were conducted using 5-fold cross-validation, with results reported as means and standard errors across the five folds.

Table 2. Performance Comparison of Different Support Set Selection Strategiesa.

strategy MSE RMSE MAE R2 Spearman
similarity-based 0.654 ± 0.012 0.809 ± 0.007 0.578 ± 0.008 0.509 ± 0.009 0.629 ± 0.007
random 0.665 ± 0.008 0.815 ± 0.005 0.585 ± 0.006 0.501 ± 0.006 0.619 ± 0.007
fixed random 0.840 ± 0.020 0.916 ± 0.011 0.681 ± 0.015 0.370 ± 0.015 0.509 ± 0.020
train-test mismatch 0.679 ± 0.020 0.824 ± 0.012 0.595 ± 0.012 0.491 ± 0.015 0.615 ± 0.010
a

Results demonstrate the importance of consistent similarity-based selection in both training and testing.

The results demonstrate that improper support set selection can degrade model performance, though with varying degrees of impact. When replacing similarity-based selection with random selection, the MSE increases from 0.654 ± 0.012 to 0.665 ± 0.008, representing a modest but statistically consistent 1.7% performance drop across all folds. This relatively small difference highlights the robust generalization capability of our meta-learning framework, which can effectively extract predictive patterns regardless of support set composition. Nevertheless, statistical analysis across all five cross-validation folds confirms that similarity-based selection consistently outperforms random selection across all evaluation metrics, with an average improvement of 1.16%. While individual metrics show modest improvements (ranging from 0.66% to 1.43%), the Fisher’s combined probability test demonstrates that this consistent pattern of improvement is statistically significant (p = 0.008). This substantiates our biological hypothesis that evolutionarily related enzymes provide more relevant information for pH prediction.

More dramatically, using fixed random support sets leads to an MSE of 0.840 ± 0.020, which is even worse than the baseline EpHod model’s performance. This substantial performance degradation occurs because fixed random selection prevents the model from experiencing diverse support sets during training, severely limiting its ability to generalize across different enzyme families and adaptation scenarios. The train-test mismatch scenario (MSE = 0.679 ± 0.020) further highlights the importance of maintaining consistent selection strategies between training and inference phases.

Impact of Support Set Size

We conducted experiments with different support set sizes (k = 5, 10, 15, 20) to determine the optimal number of similar sequences for pH prediction. As shown in Figure 3, all evaluation metrics consistently indicate that k = 5 achieves the best performance, with MSE = 0.654 ± 0.012, R2 = 0.509 ± 0.009, and Spearman correlation = 0.629 ± 0.007. Interestingly, as k increases beyond 5, we observe a gradual but consistent performance degradation across all metrics, with both error metrics (MSE, RMSE, MAE) showing steady increases and correlation metrics (R2 and Spearman) exhibiting continuous decreases. This degradation becomes particularly pronounced at k = 20, where the MSE increases by 4.4% compared to k = 5.

Figure 3.

Figure 3

Analysis of model performance under different conditions. (a) Impact of support set size (k), showing optimal performance at k = 5 with gradual degradation as k increases. b. Effect of database size reduction, demonstrating remarkable robustness with only modest degradation even at 20% of the full training set. For both plots, error metrics (MSE, RMSE, MAE) are shown on the left y-axis, while correlation metrics (R2 and Spearman) are displayed on the right y-axis.

These results reveal an important trade-off between information richness and sequence relevance in support set construction. While larger support sets theoretically provide more reference information, they also introduce noise from less similar sequences, ultimately leading to decreased prediction accuracy. Based on these comprehensive findings, we selected k = 5 for all subsequent experiments as it provides the optimal balance between prediction accuracy and computational efficiency, a choice further validated by the consistency of this pattern across all evaluation metrics.

Impact of Data Availability

To investigate how our method performs with limited data resources, we conducted experiments using different proportions (20, 60, and 100%) of the original data set. This reduction affects both the meta-training process and the size of the retrieval database used during inference. The results, visualized in Figure 3, demonstrate that Venus-DREAM maintains remarkably robust performance even with substantially reduced data availability. When using 60% of the data, we observe only a moderate increase in MSE from 0.654 ± 0.012 to 0.679 ± 0.004, representing a mere 3.7% performance degradation. More impressively, even with only 20% of the data, the model maintains strong performance with MSE reaching 0.681 ± 0.017, indicating just a 6.7% degradation from full data performance. Similar patterns are observed in correlation metrics, with R2 gradually decreasing from 0.509 ± 0.009 to 0.476 ± 0.008 as available data is reduced.

This remarkable robustness to data size reduction can be attributed to two key factors. First, our meta-learning strategy effectively extracts generalizable patterns from fewer training examples. Second, even with a smaller retrieval database during inference, the similarity-based support set selection mechanism can still identify sufficiently relevant enzyme sequences for adaptation. These results highlight Venus-DREAM’s potential for scenarios where labeled data is scarce, making it particularly valuable for exploring novel enzyme families or specialized applications with limited experimental data.

Performance Analysis across pH Ranges

To comprehensively evaluate our model’s prediction capability, we compared Venus-DREAM with EpHod across different pH ranges (Figure 4). The training data distribution shows a clear imbalance across pH ranges, with 3,558 samples (approximately 50%) concentrated in the pH 7–8 range, followed by 1,469 samples in pH 6–7. This distribution reflects the natural prevalence of enzymes operating in near-neutral conditions. Despite this imbalance, Venus-DREAM demonstrates robust performance across most pH ranges. In the well-represented pH range of 7–8, Venus-DREAM achieves the best performance with an MAE of 0.368 ± 0.009, representing an 11.5% improvement over EpHod’s MAE of 0.416. The MSE in this range (0.259 ± 0.010 vs 0.328) further confirms Venus-DREAM’s superior prediction stability. Similarly, in the pH 6–7 range, Venus-DREAM maintains its advantage with a 14.6% lower MAE (0.602 ± 0.005 vs 0.705) compared to EpHod. The consistently lower MSE values of Venus-DREAM in these ranges indicate more stable predictions with fewer outliers. For instance, in the pH 6–7 range, Venus-DREAM’s MSE (0.610 ± 0.013) is 27.7% lower than EpHod’s (0.844), demonstrating better prediction reliability in common pH ranges.

Figure 4.

Figure 4

Analysis of model performance across pH ranges. The distribution of training samples (top left) shows the data availability for each pH range, while the MAE, MSE, and RMSE comparisons demonstrate the relative performance of Venus-DREAM versus the baseline model.

However, the performance difference becomes more nuanced in less represented pH ranges. In acidic conditions (pH 5–6), Venus-DREAM shows substantial improvement with 27.2% lower MAE (0.744 ± 0.019 vs 1.012), despite having only 660 training samples. Yet, at extreme pH values, particularly in highly alkaline conditions (pH > 9) where only 153 training samples are available, EpHod shows slightly better performance (MAE: 1.492 vs 1.601 ± 0.022). This suggests that while Venus-DREAM excels in well-represented conditions, it may require more data to effectively capture extreme pH adaptation mechanisms.

Discussion

Venus-DREAM demonstrates significant advantages over existing approaches in enzyme pHopt prediction, achieving a 18.4% reduction in MSE compared to the state-of-the-art EpHod model. This improvement stems from two key innovations: a similarity-based retrieval mechanism that identifies evolutionarily related enzymes as support sets, and a Reptile-based meta-learning strategy that outperforms both traditional methods and the MAML variant (MSE 0.654 ± 0.012 vs 0.696 ± 0.030) through efficient optimization without second-order derivatives.

Our analysis reveals that model performance strongly correlates with training data distribution, which serves dual purposes: meta-learning initialization and retrieval database construction. In the pH 6–8 range where training samples are abundant (>3000 samples), the rich database enables both effective meta-learning and high-quality support set retrieval, leading to superior prediction accuracy (MAE < 0.5). Conversely, the decreased accuracy at extreme pH values (<5 and <9) primarily stems from limited training examples (493 and 153 samples respectively), affecting both meta-learning initialization and retrieval quality. Additionally, extreme pH adaptation typically involves complex molecular mechanisms: multiple coordinated mutations across both catalytic sites and structural regions work together to maintain protein stability and activity.4245 These distributed adaptive changes challenge our similarity-based approach: when pH adaptation requires mutations at many positions, even enzymes with high overall sequence similarity are more likely to differ at some of these critical positions, leading to distinct pH preferences. This differs from cases where pH adaptation is determined by fewer mutations: with fewer pH-determining positions, sequence differences are less likely to occur at these specific sites, making overall sequence similarity a more reliable indicator of functional similarity. To mitigate these challenges, we empirically determined the optimal support set size (k = 5), balancing between incorporating sufficient similar sequences and avoiding noise from less relevant examples.

The success of Venus-DREAM extends beyond pH prediction, introducing a new paradigm for protein property prediction through integrated similarity-based retrieval and meta-learning. The framework’s robustness is evidenced by maintaining a Spearman coefficient of 94.4% with only 20% training data, making it particularly valuable for novel enzyme families with scarce experimental data.

Looking forward, several directions could address current limitations. For extreme pH ranges, targeted data collection and advanced data augmentation techniques46 could improve prediction accuracy. The framework could be enhanced by incorporating structural information4749 and protein language model pretraining on larger unlabeled sequence data sets. Future work could explore more sophisticated retrieval strategies that consider both global sequence patterns and local pH-determining regions, potentially improving support set quality for enzymes with complex pH adaptation mechanisms. Additionally, the framework could be extended to multitask scenarios, simultaneously predicting pHopt alongside other enzymatic properties such as temperature optimum and stability, thereby accelerating the development of industrial biocatalysts with desired properties.

Methods

Data Set and Base Model Architecture

We utilized the benchmark data set established by EpHod,27 which contains 9855 enzymes with experimentally validated pHopt values. Following their protocol, we maintained their predefined splits: 7124 sequences for training, 760 for validation, and 1971 for testing. Each sequence was truncated to a maximum length of 1022 residues to accommodate model constraints.

Our approach builds upon EpHod, the first large-scale deep learning model specifically designed for enzyme pHopt prediction. EpHod’s architecture consists of two components: a sequence encoder Inline graphic and a prediction head Inline graphic. For an enzyme sequence S = (s1, ···, sL) where si represents amino acids, the sequence encoder employs the ESM-1v50 protein language model to generate sequence embeddings through mean pooling

graphic file with name ci4c02291_m003.jpg 1

The prediction head, implemented as a Residual Light Attention Top (RLAT) module, then maps these embeddings to pH values

graphic file with name ci4c02291_m004.jpg 2

While maintaining this original architecture, we introduce a novel few-shot learning strategy to enhance the model’s prediction capability by leveraging information from similar enzyme sequences during both training and inference.

Meta-Learning Framework

To effectively leverage the information from support sets, we explore two meta-learning algorithms: Model-Agnostic Meta-Learning (MAML)33 and Reptile.41 Both algorithms aim to learn a good initialization for fast adaptation to new tasks, but they differ in their optimization approaches.

MAML explicitly optimizes for fast adaptation through a bilevel optimization process. In the inner loop, the model adapts to each target enzyme’s support set through gradient descent

graphic file with name ci4c02291_m005.jpg 3

where α is the inner learning rate and Inline graphic represents the support set for task i. The outer loop then updates the model parameters to minimize the loss across all adapted models

graphic file with name ci4c02291_m007.jpg 4

where β is the meta learning rate. This approach requires computing second-order derivatives as it involves gradients through the adapted parameters.

In contrast, Reptile simplifies this process by performing multiple gradient steps on each support set and then moving the initial parameters toward the adapted parameters. For each target enzyme, the model performs T steps of gradient descent

graphic file with name ci4c02291_m008.jpg 5

followed by a meta-update

graphic file with name ci4c02291_m009.jpg 6

This simplified approach avoids the computational complexity of second-order derivatives while maintaining strong adaptation capabilities.

Training and Inference Strategy

Our approach leverages the biological principle that enzymes with similar sequences often exhibit similar pH preferences. To exploit this relationship, we first construct support sets for all enzymes as a preprocessing step. For each enzyme sequence, we compute its embedding using the ESM-2 encoder and calculate its similarity with all training set sequences using cosine similarity

graphic file with name ci4c02291_m010.jpg 7

The five most similar sequences are selected to form the enzyme’s support set. This preprocessing ensures consistent support sets throughout training and evaluation while capturing evolutionarily related sequences for each target enzyme. The choice of five sequences balances between information richness and computational efficiency, providing sufficient context for adaptation while avoiding noise from distant sequences.

The training process begins by initializing the model with pretrained EpHod weights. EpHod consists of two key components: the ESM-1v encoder for sequence representation and the RLAT module for pH prediction. While we use ESM-2 for similarity-based retrieval to construct support sets, the actual pH prediction is performed using the EpHod architecture. To maintain stable sequence representations, we freeze the ESM-1v encoder parameters and focus on optimizing the RLAT module through our meta-learning framework. For each training iteration, we sample a batch of target enzymes with their precomputed support sets. The model then adapts to each support set through multiple gradient steps. The loss function guiding both adaptation and meta-updates is defined as the mean squared error between predicted and actual pH values

graphic file with name ci4c02291_m011.jpg 8

After adaptation, a meta-update moves the model parameters toward the adapted state. This meta-learning process effectively captures the shared patterns across similar enzymes while allowing for task-specific adaptation. To prevent overfitting, we monitor the model’s performance every 200 steps on the validation set and implement early stopping after five epochs without improvement.

During inference, each prediction is treated as a few-shot learning task. Given a test enzyme, we first retrieve its precomputed support set of similar sequences. The model then performs five steps of adaptation starting from the meta-trained parameters, using the support set to fine-tune the RLAT module for this specific prediction task. After generating the pH prediction with the adapted parameters, the model is reset to its meta-trained state to ensure consistent performance across different predictions. This adaptation-during-inference strategy enables the model to leverage information from evolutionarily related enzymes while maintaining robust performance across diverse enzyme families.

Implementation and Evaluation

The model was implemented using PyTorch 2.0.0 and trained on NVIDIA RTX 3090 GPUs with CUDA 11.7. We evaluate model performance using multiple complementary metrics. The mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) provide different perspectives on prediction accuracy, with MSE being more sensitive to large errors and MAE providing a more intuitive measure of average deviation. The coefficient of determination (R2) and Spearman correlation coefficient (ρ) assess the model’s explanatory power and ability to preserve relative rankings, respectively

graphic file with name ci4c02291_m012.jpg 9
graphic file with name ci4c02291_m013.jpg 10
graphic file with name ci4c02291_m014.jpg 11
graphic file with name ci4c02291_m015.jpg 12
graphic file with name ci4c02291_m016.jpg 13

where yi and ŷi are the true and predicted pHopt values respectively, is the mean of true values, ryi and rŷi represent the ranks of true and predicted values, and y, ŷ are their respective mean ranks.

Acknowledgments

This work was supported by Science and Technology Innovation Key R&D Program of Chongqing (No.CSTB2022TIAD-STX0017), the National Science Foundation of China (Grant Number 12104295), and the Postdoctoral Fellowship Program of CPSF under Grant Number GZC20241010.

Data Availability Statement

The source code, data for Venus-DREAM are freely available at https://github.com/zhangliang-sys/Venus-DREAM. The benchmark data set used in this study is based on the EpHod data set, which is included in our repository under the data/ directory. All data preprocessing scripts and detailed documentation are provided in the repository to ensure reproducibility.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c02291.

  • Additional analyses and results; KNN model performance analysis with different k values (Figure S1); training loss curve of Venus-DREAM using Reptile algorithm (Figure S2); statistical comparison between Venus-DREAM and EpHod (Figure S3); and statistical comparison of prediction errors (Table S1) (PDF)

Author Contributions

M.L. and L.H. conceptualized and supervised this research project. L.Z. and M.L. developed the methodology and designed the benchmark. L.Z. implemented the method, performed the computational experiments, and analyzed the results. L.Z. and M.L. wrote the manuscript. K.L., Z.Z., Y.Y., F.J., and B.W. provided valuable insights from biological perspectives and assisted in the interpretation of results. All authors reviewed and accepted the manuscript.

The authors declare no competing financial interest.

Supplementary Material

ci4c02291_si_001.pdf (491.1KB, pdf)

References

  1. Barroca M.; Santos G.; Johansson B.; et al. Deciphering the factors defining the pH-dependence of a commercial glycoside hydrolase family 8 enzyme. Enzyme Microb. Technol. 2017, 96, 163–169. 10.1016/j.enzmictec.2016.10.011. [DOI] [PubMed] [Google Scholar]
  2. Li S.-F.; Cheng F.; Wang Y.-J.; Zheng Y.-G. Strategies for tailoring pH performances of glycoside hydrolases. Crit. Rev. Biotechnol. 2023, 43, 121–141. 10.1080/07388551.2021.2004084. [DOI] [PubMed] [Google Scholar]
  3. Daniel R. M.; Danson M. J.; Hough D. W.; Lee C. K.; Peterson M. E.; Cowan D. A.. Protein Adaptation in Extremophiles; Nova Science Publishers, 2008; pp 1–34. [Google Scholar]
  4. Chapman J.; Ismail A. E.; Dinu C. Z. Industrial applications of enzymes: Recent advances, techniques, and outlooks. Catalysts 2018, 8, 238 10.3390/catal8060238. [DOI] [Google Scholar]
  5. Patel A. K.; Singhania R. R.; Pandey A. Novel enzymatic processes applied to the food industry. Curr. Opin. Food Sci. 2016, 7, 64–72. 10.1016/j.cofs.2015.12.002. [DOI] [Google Scholar]
  6. Maurer D.; Lohkamp B.; Krumpel M.; Weidenhammer B.; Dobritzsch D. Crystal structure and pH-dependent allosteric regulation of human β-ureidopropionase, an enzyme involved in anticancer drug metabolism. Biochem. J. 2018, 475, 2395–2416. 10.1042/BCJ20180222. [DOI] [PubMed] [Google Scholar]
  7. Cronk J. D.; Endrizzi J. A.; Cronk M. R.; O’neill J. W.; Zhang K. Y. Crystal structure of E. coliβ-carbonic anhydrase, an enzyme with an unusual pH-dependent activity. Protein Sci. 2001, 10, 911–922. 10.1110/ps.46301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Joshi M. D.; Sidhu G.; Nielsen J. E.; Brayer G. D.; Withers S. G.; McIntosh L. P. Dissecting the electrostatic interactions and pH-dependent activity of a family 11 glycosidase. Biochemistry 2001, 40, 10115–10139. 10.1021/bi0105429. [DOI] [PubMed] [Google Scholar]
  9. Søndergaard C. R.; McIntosh L. P.; Pollastri G.; Nielsen J. E. Determination of electrostatic interaction energies and protonation state populations in enzyme active sites. J. Mol. Biol. 2008, 376, 269–287. 10.1016/j.jmb.2007.09.070. [DOI] [PubMed] [Google Scholar]
  10. Joshi M. D.; Sidhu G.; Pot I.; Brayer G. D.; Withers S. G.; McIntosh L. P. Hydrogen bonding and catalysis: a novel explanation for how a single amino acid substitution can change the pH optimum of a glycosidase. J. Mol. Biol. 2000, 299, 255–279. 10.1006/jmbi.2000.3722. [DOI] [PubMed] [Google Scholar]
  11. Harris C. M.; Pollegioni L.; Ghisla S. pH and kinetic isotope effects in d-amino acid oxidase catalysis: Evidence for a concerted mechanism in substrate dehydrogenation via hydride transfer. Eur. J. Biochem. 2001, 268, 5504–5520. 10.1046/j.1432-1033.2001.02462.x. [DOI] [PubMed] [Google Scholar]
  12. Harris T. K.; Turner G. J. Structural basis of perturbed pKa values of catalytic groups in enzyme active sites. IUBMB Life 2002, 53, 85–98. 10.1080/15216540211468. [DOI] [PubMed] [Google Scholar]
  13. Mapiour M.; Amira A. Critical influences of plasma pH on human protein properties for modeling considerations: Size, charge, conformation, hydrophobicity, and denaturation. J. Compos. Sci. 2023, 7, 28 10.3390/jcs7010028. [DOI] [Google Scholar]
  14. Yang A. S.; Honig B. On the pH dependence of protein stability. J. Mol. Biol. 1993, 231, 459–474. 10.1006/jmbi.1993.1294. [DOI] [PubMed] [Google Scholar]
  15. Lu J. R.; Su T. J.; Howlin B. J. The effect of solution pH on the structural conformation of lysozyme layers adsorbed on the surface of water. J. Phys. Chem. B 1999, 103, 5903–5909. 10.1021/jp990129z. [DOI] [Google Scholar]
  16. Baptista A. M.; Martel P. J.; Petersen S. B. Simulation of protein conformational freedom as a function of pH: constant-pH molecular dynamics using implicit titration. Proteins: Struct., Funct., Bioinf. 1997, 27, 523–544. 10.1002/(SICI)1097-0134(199704)27:4<523::AID-PROT6>3.0.CO;2-B. [DOI] [PubMed] [Google Scholar]
  17. Elleuche S.; Schröder C.; Sahm K.; Antranikian G. Extremozymes—biocatalysts with unique properties from extremophilic microorganisms. Curr. Opin. Biotechnol. 2014, 29, 116–123. 10.1016/j.copbio.2014.04.003. [DOI] [PubMed] [Google Scholar]
  18. Kuddus M.Enzymes in Food Biotechnology: Production, Applications, and Future Prospects; Academic Press: London, 2018. [Google Scholar]
  19. Meghwanshi G. K.; Kaur N.; Verma S.; Dabi N. K.; Vashishtha A.; Charan S. S.; Purohit P.; Bhandari H. S.; Bhojak N.; Kumar R. Enzymes for pharmaceutical and therapeutic applications. Biotechnol. Appl. Biochem. 2020, 67 (4), 586–601. 10.1002/bab.1919. [DOI] [PubMed] [Google Scholar]
  20. Bowman L.; Motamed R.; Lee P.; Aleem K.; Berawala A. S.; Hayden K. L.; Bzik D. J.; Chattopadhyay D. A simple and reliable method for determination of optimum pH in coupled enzyme assays. BioTechniques 2020, 68, 200–203. 10.2144/btn-2019-0126. [DOI] [PubMed] [Google Scholar]
  21. Herlet J.; Kornberger P.; Roessler B.; Glanz J.; Schwarz W. H.; Liebl W.; Zverlov V. V. A new method to evaluate temperature vs. pH activity profiles for biotechnological relevant enzymes. Biotechnol. Biofuels 2017, 10, 234 10.1186/s13068-017-0923-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhang G.-Y.; Li H.-C.; Fang B.-S. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition. Process Biochem. 2009, 44, 654–660. 10.1016/j.procbio.2009.02.007. [DOI] [Google Scholar]
  23. Lin H.; Chen W.; Ding H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 2013, 8, e75726 10.1371/journal.pone.0075726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li X.; Dou Z.; Sun Y.; Zhou J.; Yang H.; Yang S. A sequence embedding method for enzyme optimal condition analysis. BMC Bioinf. 2020, 21, 512 10.1186/s12859-020-03851-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Shahraki M. F.; Atanaki F. F.; Ariaeenejad S.; et al. A computational learning paradigm to targeted discovery of biocatalysts from metagenomic data: A case study of lipase identification. Biotechnol. Bioeng. 2022, 119, 1115–1128. 10.1002/bit.28037. [DOI] [PubMed] [Google Scholar]
  26. Rives A.; Meier J.; Sercu T.; Goyal S.; Lin Z.; Liu J.; Guo D.; Ott M.; Zitnick C. L.; Ma J.; Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 2021, 118, e2016239118 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Gado J. E.; Knotts M.; Shaw A. Y.; Marks D.; Gauthier N. P.; Sander C.; Beckham G. T.. Deep learning prediction of enzyme optimum pH bioRxiv 2023.
  28. Zaretckii M.; Buslaev P.; Kozlovskii I.; Morozov A.; Popov P. Approaching Optimal pH Enzyme Prediction with Large Language Models. ACS Synth. Biol. 2024, 13, 3013–3021. 10.1021/acssynbio.4c00465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Tan Y.; Wang R.; Wu B.; Xiao J.; Gao M.. Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model. 2024, arXiv:2410.21127. arXiv.org e-Printarchive. https://arxiv.org/abs/2410.21127.
  30. Notin P.; Dias M.; Frazer J.; Marchena-Hurtado J.; Gomez A.; Marks D. S.; Gal Y. In Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-Time Retrieval, International Conference on Machine Learning; PMLR, 2022; pp 16990–17017.
  31. Notin P.; Van Niekerk L.; Kollasch A. W.; Ritter M.; Gal Y.; Marks D. S.. TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction bioRxiv 2022.
  32. Hospedales T. M.; Antoniou A.; Micaelli P.; Storkey A. Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5149–5169. 10.1109/TPAMI.2021.3079209. [DOI] [PubMed] [Google Scholar]
  33. Finn C.; Abbeel P.; Levine S. In Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, International Conference on Machine Learning; PMLR, 2017; pp 1126–1135.
  34. Lee D.; Redfern O.; Orengo C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007, 8, 995–1005. 10.1038/nrm2281. [DOI] [PubMed] [Google Scholar]
  35. Lu Q.; Zhang R.; Zhou H.; Chen Y.; Zheng M.; Luo H. MetaHMEI: meta-learning for prediction of few-shot histone modifying enzyme inhibitors. Briefings Bioinf. 2023, 24, bbad115 10.1093/bib/bbad115. [DOI] [PubMed] [Google Scholar]
  36. Minot M.; Reddy S. T. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering. Cell Syst. 2024, 15, 4–18.E4. 10.1016/j.cels.2023.12.003. [DOI] [PubMed] [Google Scholar]
  37. Fix E.Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties; USAF School of Aviation Medicine: Brooks Air Force Base: TX, 1985. [Google Scholar]
  38. Wang Y.; Yao Q.; Kwok J. T.; Ni L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 2021, 53, 1–34. 10.1145/3386252. [DOI] [Google Scholar]
  39. Lewis P.; Perez E.; Piktus A.; Petroni F.; Karpukhin V.; Goyal N.; Küttler H.; Lewis M.; Yih W.-t.; Rocktäschel T.. et al. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Advances in Neural Information Processing Systems; NIPS, 2020; pp 9459–9474.
  40. Lin Z.; Akin H.; Rao R.; Hie B.; Zhu Z.; Lu W.; Smetanin N.; Verkuil R.; Kabeli O.; Shmueli Y.. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction bioRxiv 2022.
  41. Nichol A.On First-Order Meta-Learning Algorithms. 2018, arXiv:1803.02999. arXiv.org e-Printarchive. https://arxiv.org/abs/1803.02999.
  42. Mamo G.; Thunnissen M.; Hatti-Kaul R.; Mattiasson B. An alkaline active xylanase: insights into mechanisms of high pH catalytic adaptation. Biochimie 2009, 91, 1187–1196. 10.1016/j.biochi.2009.06.017. [DOI] [PubMed] [Google Scholar]
  43. Jaenicke R. Protein stability and molecular adaptation to extreme conditions. Eur. J. Biochem. 1991, 202, 715–728. 10.1111/j.1432-1033.1991.tb16426.x. [DOI] [PubMed] [Google Scholar]
  44. Suplatov D.; Panin N.; Kirilin E.; Shcherbakova T.; Kudryavtsev P.; Švec P. Computational design of a pH stable enzyme: understanding molecular mechanism of penicillin acylase’s adaptation to alkaline conditions. PLoS One 2014, 9, e100643 10.1371/journal.pone.0100643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Ma F.; Xie Y.; Luo M.; Wang Y.; Hu Y.; Liu Y.; Feng Y.; Yang G.-Y. Sequence homolog-based molecular engineering for shifting the enzymatic pH optimum. Synth. Syst. Biotechnol. 2016, 1, 195–206. 10.1016/j.synbio.2016.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sun R.; Wu L.; Lin H.; Xiao J.; Huang X.; Gao M.. Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions. 2024, arXiv:2403.00875. arXiv.org e-Printarchive. https://arxiv.org/abs/2403.00875.
  47. Orengo C. A.; Todd A. E.; Thornton J. M. From protein structure to function. Curr. Opin. Struct. Biol. 1999, 9, 374–382. 10.1016/S0959-440X(99)80051-7. [DOI] [PubMed] [Google Scholar]
  48. Gherardini P. F.; Helmer-Citterich M. Structure-based function prediction: approaches and applications. Briefings Funct. Genomics Proteomics 2008, 7, 291–302. 10.1093/bfgp/eln030. [DOI] [PubMed] [Google Scholar]
  49. Dubnovitsky A. P.; Kapetaniou E. G.; Papageorgiou A. C. Enzyme adaptation to alkaline pH: atomic resolution (1.08 Å) structure of phosphoserine aminotransferase from Bacillus alcalophilus. Protein Sci. 2005, 14, 97–110. 10.1110/ps.041029805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Meier J.; Rao R.; Verkuil R.; Liu J.; Sercu T.; Rives A. In Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function, Advances in Neural Information Processing Systems; NIPS, 2021; pp 29287–29303.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci4c02291_si_001.pdf (491.1KB, pdf)

Data Availability Statement

The source code, data for Venus-DREAM are freely available at https://github.com/zhangliang-sys/Venus-DREAM. The benchmark data set used in this study is based on the EpHod data set, which is included in our repository under the data/ directory. All data preprocessing scripts and detailed documentation are provided in the repository to ensure reproducibility.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES