Abstract
Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence‐to‐function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models—ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low‐order mutants to high‐order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low‐fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data‐driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.
Keywords: feature engineering, machine learning, protein engineering, protein fitness landscape
1. INTRODUCTION
Protein function is determined by its amino acid sequence, and the diverse range of amino acid sequences results in a wide range of protein functions. Correspondingly, the correlation between the two can be represented via the protein fitness landscape (Romero et al. 2013). Protein engineering modification involves the search for the amino acid sequence that corresponds to the highest point on the fitness landscape (Freschlin et al. 2022). Nevertheless, the space of possible amino acid sequences is very large. The landscape space of this fitness increases exponentially as the number of amino acid residues taken into account increases. As a result, the exhaustive exploration of this landscape through experiments, calculations, or any other methods is impossible (Yang et al. 2019). On the other hand, functional proteins are exceedingly rare. As the anticipated functional level rises, the quantity of sequences exhibiting this function diminishes exponentially (Romero and Arnold 2009). Directed evolution addresses these challenges by utilizing a greedy local search, which is widely recognized as an effective approach for exploring the protein fitness landscape (Wittmann et al. 2021). However, the assessment of this procedure incurs significant costs and requires a substantial amount of time (Wang et al. 2021). Correspondingly, the existence of epistatic effects will cause it to fall into a local optimum (Hu et al. 2023). Recently, data‐driven ML has emerged as a novel and effective approach for exploring the protein fitness landscape, which plays an important role in protein engineering.
Unlike directed evolution, ML utilizes sequence and screening data to construct a model that comprehends the mapping relationship between the sequence and function of mutants (Xu et al. 2020). It can accurately predict the fitness of unknown protein mutants, thereby replacing expensive laboratory experiments and significantly enhancing screening capabilities. ML models have the ability to make predictions even in cases where the biophysical characteristics of the target protein are not fully understood (Yang et al. 2019). Furthermore, ML techniques effectively utilize the low fitness mutant data that is disregarded during directed evolution to gain a comprehensive understanding of the protein fitness landscape (Sankara Narayanan and Runthala 2022; Yang et al. 2019). This approach prevents local optimization and allows for the discovery of mutants with higher fitness values or even globally optimal mutants.
The choice of methods for encoding amino acid sequences is a crucial factor in determining the success of ML in protein engineering. It dictates the information that can be learned by the ML algorithms and greatly influences the outcomes. Notably, each method of encoding amino acid sequences is predicated on a specific property. However, the function of proteins is often determined by a combination of various properties. In this context, the use of only one sequence encoding method may not adequately capture complex biomolecular structures and properties, potentially constraining the performance of ML models. Earlier studies have demonstrated an enhancement in the performance of ML models by amalgamating sequence encoding methods with additional techniques or by combining them with other sequence encoding methods. For instance, Cadet et al. converted the digital vector encoded by the Amino Acid index (AAindex, referred to as AAI in this paper) into protein spectra using Fast Fourier transform (FFT) and successfully achieved the enantioselective directed evolution of Aspergillus niger epoxide hydrolase (Cadet et al. 2018). Similarly, Fontaine et al. improved the predictive performance and accuracy of a ML model by utilizing multiple indices from the AAI database and implementing FFT (Fontaine et al. 2019). On the other hand, Mckenna et al. introduced a new protein predictive framework called PySAR, by combining protein spectra with multiple descriptors, resulting in strong predictive capabilities on the cytochrome P450s dataset (Mckenna and Dubey 2022). However, the applicability of this method to all datasets was not been validated. Furthermore, while the aforementioned studies employed FFT, it may not be appropriate for all algorithms (Siedhoff et al. 2021). The combination of multiple descriptor encoding methods has the potential to enhance the predictive capability of the model, although it may also result in feature redundancy in certain scenarios. Moreover, with an increasing number of descriptors, the issue of “dimensionality catastrophe” can occur, leading to challenges in model training and inference, and necessitating a larger amount of data to obtain reliable results.
To address these issues, we proposed a novel protein fitness predictive framework named scut_ProFP, as illustrated in Figure 1. Within this framework, we combined features from AAI and multiple protein descriptors to construct predictive models using five ensemble algorithms: RF, GBR, XGB, Ada, and Bag. Three feature selection methods then optimized the higher‐ranked models. The proposed method outperforms similar frameworks and is competitive with more complex deep learning models. In one instantiation, it achieved maximum performance on 6 out of 10 deep mutational scanning (DMS) datasets—more than any other method. Further experiments on combinatorial mutagenesis datasets showed that scut_ProFP generalizes from low‐order to high‐order mutants. Additionally, using data from low‐fluorescent mutants, scut_ProFP successfully simulated the directed evolution of the fluorescent protein CreiLOV, significantly enriching high‐fluorescence mutants.
FIGURE 1.

Schematic diagram of the scut_ProFP protein fitness predictive framework.
2. RESULTS
2.1. Models performance comparison from different algorithm
To determine the best predictive model for protein fitness, we first evaluated and analyzed five ensemble algorithms (RF, GBR, XGB, Ada, and Bag) across three different datasets: GR, ANEH, and P450s. A total of 566 × 15 = 8490 models were constructed under each algorithm. Figure 2 shows the distribution of R 2 and RMSE values for all models across each algorithm. The results indicate that the five ensemble algorithms demonstrate similar performance distributions across the three datasets. Among them, the GBR algorithm demonstrated the best performance across all tasks. The best models in the GR, ANEH, and P450s datasets were all produced by the GBR algorithm. This is supported by their R 2 values of 0.895, 0.796, and 0.727, and RMSE values of 10.210, 14.439, and 2.419, respectively (Table S1, Supporting Information). The RF and Ada algorithms were found to have a similar performance, ranking second only to the GBR algorithm. In contrast, the models' performance under the XGB and Bag algorithms exhibited significant variability. Nevertheless, the XGB and Bag algorithms exhibited superior performance scores compared to the RF and Ada algorithms on the best predictive model. In addition, we also compared it with general regression algorithms, including PLS, LR, DTR, KNN, and SVR. The findings indicate that across all datasets, ensemble algorithms consistently outperform general regression algorithms, particularly LR, SVR, and DTR algorithms (Figure S1). Therefore, we prioritize recommending the GBR as the preferred algorithm for the scut_ProFP predictive framework.
FIGURE 2.

Violin plots of R 2 and RMSE values for each ensemble algorithm. (a, d) GR dataset; (b, e) ANEH dataset; (c, f) P450s dataset.
2.2. Performance comparison of different feature combination encoding methods
In order to determine the most advantageous feature combination encoding method for predicting protein fitness, we performed a statistical analysis on the R 2 and RMSE scores of all models using each method. Each feature combination encoding method resulted in the construction of a total of 566 × 5 = 2830 models. As can be seen in Figure 3, no feature combination encoding method performs best in all tasks. In contrast, the three feature combination encoding methods, AAI_AAC_DC, AAI_AAC_CT, and AAI_DC_CT, demonstrate good performance in various scenarios and are considered the primary feature encoding methods for the scut_ProFP framework. Moreover, to further demonstrate the effectiveness of this feature combination encoding method, we conducted experiments on combinatorial feature ablation using the top three models from each dataset as examples (Table S1). The results show that the combined feature encoding method achieved the highest scores in all models in the GR and ANEH datasets (Table S2), indicating that it outperforms all other feature encoding methods. In the P450s dataset, it achieved the second‐highest performance score, following the AAI encoding method (Table S2).
FIGURE 3.

Box plots of R 2 and RMSE values for different feature combination encoding methods. (a, b) GR dataset; (c, d) ANEH dataset; (e, f) P450s dataset.
2.3. Combined feature optimal feature subset search
While combining features can improve model performance, it may also introduce redundant features and increase feature dimensions, leading to a more complex model. In particular, irrelevant features can reduce predictive accuracy (Table S2). Therefore, using the top three ranked models from each dataset (Table S1), we applied three feature selection methods (section 4.3) to identify the optimal subset of combined features. The results, shown in Figure 4 and Table S3, indicate significant performance improvement with each feature selection method, followed by varying degrees of fluctuation as the number of features increased. The Shap + SFS method demonstrated the best performance, with the GR, ANEH, and P450s datasets achieving the highest R2 values of 0.962, 0.858, and 0.837 based on 107D, 264D, and 30D feature subsets, respectively. The SFS method performed similarly but required 10–20 times more running time (Figure S2), particularly with higher‐dimensional features. The Shap method performing the worst. Furthermore, after Shap + SFS feature selection, the feature dimension was significantly reduced compared to the SFS method (Table S3). These findings suggest that Shap + SFS outperforms SFS in removing redundant information and noise from feature vectors.
FIGURE 4.

R 2 performance plots of the top three models with different feature selection methods. (a–c) GR dataset; (d–f) ANEH dataset; (g–i) P450s dataset.
Taken together, we offer a detailed guide for using the scut_ProFP framework:
For algorithm selection, the GBR algorithm is recommended as it performed best across all tasks.
For feature encoding, the combined methods AAI_AAC_DC, AAI_AAC_CT, and AAI_DC_CT are suggested, as they showed top performance in multiple scenarios.
For feature selection, the Shap + SFS method is advised due to its ability to significantly improve model performance in a short time.
By following these recommendations, researchers can achieve better results using this predictive framework. Subsequent studies have adopted this guideline.
2.4. Comparison with existing methods
To demonstrate the superiority of the scut_ProFP framework, we first compared it with the PyPEF (Siedhoff et al. 2021) and PySAR (Mckenna and Dubey 2022) frameworks. Both frameworks, like ours, encode sequences using AAI or protein descriptors and employ traditional algorithms to build models. Our method clearly outperforms these in terms of R 2, MAE, and Spearman's ρ, both on our dataset and independent datasets, surpassing even certain deep learning methods like MLP (Table 1). PyPEF performed best on RMSE for the P450s and RmaNOD datasets, likely due to hyperparameter tuning not included in our method or PySAR. PyPEF's feature extraction is limited to AAI, resulting in restricted sequence information, while PySAR, despite using multiple feature extraction techniques, suffers from feature redundancy. In contrast, our method effectively balances these two aspects, achieving superior predictive performance.
TABLE 1.
Comparison of performance scores between scut_ProFP and similar frameworks.
| Dataset | Method | Algorithm | R 2 | MAE | RMSE | ρ |
|---|---|---|---|---|---|---|
| GR | scut_ProFP | GBR | 0.962 | 5.149 | 7.095 | 0.946 |
| PyPEF | RF | 0.935 | 10.282 | 7.322 | 0.936 | |
| PySAR | PLS | 0.882 | 8.060 | 9.871 | 0.903 | |
| ANEH | scut_ProFP | GBR | 0.858 | 8.665 | 12.168 | 0.925 |
| PyPEF | RF | 0.819 | 9.578 | 14.044 | 0.917 | |
| PySAR | Bag | 0.800 | 11.102 | 15.732 | 0.880 | |
| P450s | scut_ProFP | GBR | 0.837 | 1.574 | 1.946 | 0.899 |
| PyPEF | MLP | 0.803 | 2.311 | 1.775 | 0.877 | |
| PySAR | RF | 0.665 | 2.196 | 2.739 | 0.776 | |
| RmaNOD | scut_ProFP | GBR | 0.755 | 0.128 | 0.184 | 0.857 |
| PyPEF | RF | 0.629 | 0.231 | 0.170 | 0.788 | |
| PySAR | RF | 0.751 | 0.130 | 0.190 | 0.846 |
Bold indicates the best scores obtained for each evaluation metric.
In addition, we compared scut_ProFP with three deep learning methods—ECNet (Luo et al. 2021), EVmutation (Hopf et al. 2017), and UniRep (Alley et al. 2019)—across 10 DMS datasets (Chen et al. 2023a; Hsu et al. 2022) to assess their performance in predicting sequence fitness rankings. For each dataset, 20% of the data was set aside for testing, while 24, 48, 96, 192, 288, 384, and 480 samples (considered low‐N datasets) were randomly selected from the remaining 80% for training. Furthermore, to allow a more direct comparison with EVmutation and to evaluate model performance on high‐N datasets, we also trained the models using the full remaining 80% of the data. For each training set size, including the 80/20 split, we performed 5‐fold cross‐validation on the training set and evaluated performance on the test set, averaging the results across five random seed partitions. Apart from EVmutation, whose performance is independent of training set size, the performance of all other supervised methods significantly improves with an increase in training data. Among these, scut_ProFP exhibits relatively better average performance, followed by ECNet and UniRep (Figure 5a). Although scut_ProFP does not achieve high performance that is independent of training set size, it performs better in both low‐N and high‐N training sets. When 24 ≤ N ≤ 96, scut_ProFP outperforms ECNet on 7 out of 10 datasets and UniRep on 8 out of 10 datasets. Furthermore, after the training set size reaches 192, scut_ProFP begins to outperform EVmutation on 7 out of 10 datasets (Figure S3). Under the 80/20 data split, scut_ProFP also achieves the best average performance, with the highest performance on 6 out of 10 datasets (compared to 3 for ECNet, 1 for EVmutation, and none for UniRep) (Table S4 and Figure 5b). In conclusion, scut_ProFP demonstrates certain competitiveness when compared to more complex deep learning models such as ECNet, EVmutation, and UniRep.
FIGURE 5.

Comparison with other deep learning modeling methods. (a) The mean Spearman's ρ across all 10 datasets. The horizontal axis shows the number of training data used. The performance value for each dataset represents the mean of five different splits, with the shaded area indicating the standard deviation (same for the following figures). (b) The mean Spearman's ρ for each DMS dataset when trained on 80% of the data and tested on the remaining 20%. The given performance represents the mean of five different splits.
2.5. Extrapolation from lower to higher‐order mutants
Constructing and screening higher‐order mutants usually requires extensive experimental work and time (Luo et al. 2021). Therefore, the ability to train ML models using fitness data from lower‐order mutants to accurately predict the fitness of higher‐order mutants is crucial for protein engineering. We evaluated the ability of scut_ProFP to predict the fitness ranking of double mutants using data from single mutants on the GB1 and YAP1 datasets (Chen et al. 2023a). The results showed that as the size of the training set increased, the performance of all supervised methods improved. For the GB1 dataset, when N ≥ 288, the performance of all models tended to stabilize, with scut_ProFP stabilizing at around Spearman's ρ = 0.75, outperforming all other methods (Figure 6a). In the YAP1 dataset, ECNet slightly outperformed scut_ProFP when N = 288 (Figure 6b). When trained on the full single‐mutant data, scut_ProFP achieved the highest performance in the GB1 dataset (Spearman's ρ = 0.764) and performed second only to ECNet (Spearman's ρ = 0.669) in the YAP1 dataset (Spearman's ρ = 0.632) (Figure S4). Additionally, we evaluated the ability of scut_ProFP to predict different higher‐order mutants using 1–3 order mutants in the CreiLOV dataset (Chen et al. 2023b). The results indicated that as the number of mutations increased, the performance of both scut_ProFP and ECNet gradually declined. Among them, scut_ProFP slightly outperforms ECNet in predicting 4–8 site mutants, while it performs slightly worse than ECNet in predicting mutants with 9 or more mutations. However, both methods outperform EVmutation (Figure 6c). It is worth noting that the performance of EVmutation improves in predicting higher‐order mutants, possibly because its epistatic model captures interactions between residue pairs more effectively. In summary, scut_ProFP can generalize from lower‐order to higher‐order mutants and, in many cases, is comparable to the ECNet model.
FIGURE 6.

Extrapolation from lower to higher‐order mutants. (a, b) The mean Spearman's ρ for predicting double‐mutant fitness using different amounts of single‐mutant fitness data to train the model on the GB1 and YAP1 datasets, respectively. (c) The mean Spearman's ρ for predicting the fitness of 4–14 different higher‐order mutants, respectively, when trained using 1–3 order mutants on the CreiLOV dataset. (d) The mean Spearman's ρ for predicting different epistasis on the GB1 dataset.
Research shows that epistasis exists between mutations, meaning that the fitness of combined mutants may not always equal the sum of the fitnesses of the individual single mutants (Breen et al. 2012). To analyze whether scut_ProFP can capture the epistasis between mutations, we correlated observed epistasis with predicted epistasis using the GB1 dataset. Epistasis was calculated using the method described by Luo et al. (2021). Based on the positive or negative values of epistasis, we classified double mutants into positive and negative epistasis categories. The results indicate that no single method performs best across all types of epistasis predictions. Compared to EVmutation, which explicitly models epistasis, scut_ProFP, ECNet, and UniRep exhibit better correlation between predicted and observed epistasis, and the performance of these three methods is comparable (Figure 6d). This suggests that scut_ProFP is capable of capturing residue dependencies, achieving prediction performance comparable to complex models like ECNet.
2.6. Simulating the engineering of fluorescent protein CreiLOV using scut_ProFP
In protein engineering, aside from the fact that high‐order mutants are difficult to construct and screen, mutants with high‐fitness values are even rarer and difficult to obtain directly. That is, most mutations result in reduced or complete loss of protein function (Worth et al. 2011). Therefore, predicting mutants with high fitness values from low fitness value mutants is equally important for protein engineering. We simulated the scut_ProFP‐guided protein evolution process using the fluorescent protein CreiLOV as a model protein. The fluorescent protein CreiLOV dataset comprises 165,428 data points, encompassing all possible combinations of mutants for 20 residues at 15 sites (Chen et al. 2023b). Based on the study above, we randomly selected 192 data points from mutants with low fluorescence values (fluorescence value <1.8; the maximum fluorescence value being 2.508) for modeling. This model was then used to predict the fluorescence values of the remaining 165,236 mutants. Subsequently, the top 50 predicted mutants were selected each time for validation and iterative learning.
As shown in Figure 7, during the initial round of ML, the model obtained an R 2 value of 0.879 and a Spearman's ρ value of 0.941 on the test set (Figure 7c). In comparison to the wild type (fluorescence value of 1.216), the top 50 mutants in the predictive ranking showed significant improvement in their fluorescence values (Figure 7b), achieving a success rate of 100%. The highest fluorescence value achieved by the best mutant was 2.444, placing it fourth among all mutants, and the globally optimal mutant ranked 206 in this particular round of prediction (Table S5). During the subsequent iterative learning process, the model's predictive performance reached a plateau in the second round. The fluorescence values of the top 50 mutants in the predictive ranking were concentrated at high levels (Figure 7b). During this iteration of predictive analysis, a mutant with a fluorescence value of 2.442 was identified, placing it fifth in the global rankings. Additionally, the globally optimal mutant's ranking improved to 75th (Table S5). Despite being identified as the third‐ranked mutant during the third round of ML, the fluorescence values of the top 50 predicted mutants and the ranking of the global optimal mutant started to decline, particularly during the fourth round of iterative learning (Table S5 and Figure 7b). This phenomenon can be attributed to the proliferation of high‐fitness value mutants, resulting in the overfitting of the predictive model when fitness values are high. This phenomenon is evident in every learning iteration, as the predictive model consistently improves its performance on the training set (Figure 7a). However, the predictive performance on the test set exhibits a declining pattern (Figure 7c). An effective approach involves not solely concentrating on the incorporation of mutants with high fitness values during the process of iterative learning. Having a wide range of data is more favorable for the process of model learning, enabling accurate assessments of unfamiliar data.
FIGURE 7.

Directed evolution of the fluorescent protein CreiLOV, guided by the scut_ProFP framework. (a) Performance scores of each round of ML on the training set. (b) The violin plots of the fluorescence values of mutants input for each round of ML and the fluorescence values of the predicted top 50 mutants. The black dotted line represents the fluorescence value of the wild type, while the green dotted line represents the fluorescence value of the globally optimal mutant. (c) Plots for the data fitting on the test set for each round of ML.
3. DISCUSSION
In this work, we propose a universal predictive framework named scut_ProFP, which significantly improves the prediction performance of ML models by integrating feature combination and feature selection techniques. Among the five ensemble algorithms tested, the GBR algorithm demonstrated outstanding performance across multiple tasks and is considered the preferred algorithm (Figure 2). Different feature combination encoding methods showed similar performance distributions across different datasets, with AAI_AAC_DC, AAI_AAC_CT, and AAI_DC_CT performing particularly well (Figure 3). Although feature combination can enhance model performance, the improvement is limited. Feature selection techniques allow for further enhancement of model performance. Among the three feature selection methods tested, the Shap + SFS method performed best in eliminating redundant features (Figure 4), providing an effective strategy for determining the optimal feature subset and improving model performance. Based on these findings, we provide a guide for other researchers to use scut_ProFP for various protein engineering tasks (see section 2.3).
Among the existing predictive frameworks, PyPEF (Siedhoff et al. 2021) and PySAR (Mckenna and Dubey 2022) are the two models most similar to scut_ProFP, as they both rely solely on sequence features for modeling. However, scut_ProFP outperforms these two frameworks across various application scenarios (Table 1). This is mainly because PyPEF only employs a single method for sequence feature extraction, while PySAR combines multiple feature extraction methods but fails to effectively handle the resulting redundant features. In contrast, scut_ProFP achieves superior predictive performance by integrating sequence feature combinations with redundancy removal. Moreover, with the rapid development of artificial intelligence, an increasing number of deep learning models have emerged. These models no longer rely solely on sequence features but incorporate evolutionary information from homologous proteins or learn semantically rich representations from large‐scale sequence datasets. In this study, we compared scut_ProFP with three more complex deep learning models: ECNet (Luo et al. 2021), EVmutation (Hopf et al. 2017), and UniRep (Alley et al. 2019). Although scut_ProFP did not achieve high performance independent of training set size, it shows relatively good performance in both low‐N and high‐N datasets (Figures 5 and S3). Additionally, scut_ProFP is capable of capturing epistasis in sequence mutations and can generalize from low‐order mutants to high‐order mutants (Figure 6). When applied to the engineering design simulation of the fluorescent protein CreiLOV, scut_ProFP successfully enriched a large number of high‐fluorescence mutants, even with only a small amount of data from low‐fluorescence mutants (Figure 7).
Although scut_ProFP has demonstrated outstanding performance across multiple testing tasks, there are still areas for future improvement. The AAI encompasses 566 different indices (Kawashima and Kanehisa 2000). When combined with other descriptors, it results in a large number of feature combinations, leading to high computational resource demands and lengthy testing processes. Through principal component analysis (PCA) or by combining subsets of AAI, several new descriptors have been derived, such as sPairs (Tanaka and Scheraga 1976), sScales (Biou et al. 1988), Z‐scales (Sandberg et al. 1998), T‐scale (Tian et al. 2007), and ProFET (Ofer and Linial 2015). Future research will consider using these processed descriptors as substitutes for AAI encoding methods. This approach aims to reduce the number of feature combinations and enhance the scalability of scut_ProFP in terms of computational resources. Furthermore, integrating evolutionary information from homologous sequences into the scut_ProFP framework holds promise for further improving its performance in predicting epistasis and extrapolation. Given the potential complex interactions between features generated by different encoding methods, combining multiple feature encoding techniques increases the difficulty of model interpretation and understanding. Apart from using Shap + SFS feature selection to eliminate irrelevant features, techniques like PCA and t‐distributed Stochastic Neighbor Embedding (t‐SNE) can be considered to reduce feature dimensions (Anowar et al. 2021), simplify the model, and thereby improve interpretability and comprehensibility.
In conclusion, our work provides a valuable tool for predicting protein fitness in various scenarios. This tool is expected to guide experimental work in laboratories, reducing the manpower, material, and financial resources required for wet experiments, thus promoting the advancement of protein engineering.
4. MATERIALS AND METHODS
4.1. Data collection and preparation
The GR, ANEH, P450s, and RmaNOD datasets used in this study were obtained from supplementary information provided by Siedhoff et al. (Siedhoff et al. 2021). These datasets cover various scenarios in protein engineering, including protein absorption wavelength shifts, enantioselectivity, and thermal stability. All DMS datasets and combined mutation datasets were obtained from supplemental information provided by Hsu et al. and Chen et al. (Chen et al. 2023a; Hsu et al. 2022).
4.2. Feature extraction and combination
This study utilizes two primary amino acid sequence encoding methods. One is the AAI, which reflects various physicochemical properties of amino acids. It contains a total of 566 different property indices, each of which is related to one or more properties (Kawashima and Kanehisa 2000). The other is a protein descriptor that reflects various compositions of amino acids and other information, including Amino Acid Composition (AAC) (Carugo 2008), Dipeptide Composition (DC) (Raghava and Han 2005), Conjoint Triad (CT) (Shen et al. 2007), Geary Autocorrelation (GA) (Liang et al. 2017), Sequence‐Order‐Coupling Number (SOCN), and Quasi‐Sequence‐Order (QSO) (Chou 2000). Table S6 provides a concise overview of the specifics of each feature encoding method. Prior to being used as an input for the model, the feature vectors obtained using the aforementioned feature encoding method were merged in the format of an AAI and two protein descriptors, resulting in a total of 15 distinct feature combination encoding methods: AAI_AAC_DC, AAI_AAC_CT, AAI_AAC_GA, AAI_AAC_SOCN, AAI_AAC_QSO, AAI_DC_CT, AAI_DC_GA, AAI_DC_SOCN, AAI_DC_QSO, AAI_CT_GA, AAI_CT_SOCN, AAI_CT_QSO, AAI_GA_SOCN, AAI_GA_QSO, AAI_SOCN_QSO. Each feature combination method comprises 566 distinct combinations of features, which are utilized to construct models in conjunction with various algorithms. All of the aforementioned feature encoding methods were calculated using the PySAR library (Mckenna and Dubey 2022).
4.3. Feature selection
The features used establish the upper limit that the ML model can achieve, and effective feature selection can bring the model to a point of near‐infinite proximity to this upper limit. Furthermore, feature selection can also decrease the number of features to mitigate issues related to high dimensionality, reduce the training time, improve the model's ability to generalize, and minimize the risk of overfitting. The present study introduces three distinct feature selection methods to identify the optimal subset of features from the aforementioned combined features, and evaluates and compares them. The three different feature selection methods include: (i) Shap: The Shap algorithm was used to assess the significance of each feature in the combined features (Lundberg and Lee 2017). Based on the ranking of feature importance, the top 10 features were sequentially incorporated into the model. (ii) SFS: The Sequential Forward Search algorithm was used to search all the features in the combined features one by one (Pudil et al. 1994). (iii) Shap + SFS: This approach comprised a fusion of the aforementioned two methods. First, the Shap algorithm was used to assess the significance of each feature within the combined set of features. Subsequently, based on the dimensions of the combined features, the top i features (where i = 300, when the dimension of the combination feature is >300; otherwise, i is equal to the dimension of the feature combination) were selected and a sequential forward search was conducted.
4.4. Predictive models
This study employed five ensemble regression algorithms to construct the scut_ProFP protein fitness predictive framework. The algorithms encompassed in this set are Random Forest (RF), Gradient Boosting Regression (GBR), Extreme Gradient Boosting (XGB), Adaptive Boosting (Ada), and Bagging (Bag). Furthermore, five regression algorithms were tested, including Partial Least Squares Regression (PLS), Logistic Regression (LR), Decision Tree Regression (DTR), Support Vector Regression (SVR), and K‐Nearest Neighbor (KNN). The algorithms used in this study were sourced from the Scikit‐Learn toolkit (Pedregosa et al. 2011), and the default parameters were employed for each algorithm.
4.5. Model validation and evaluation
The study conducted 10‐fold cross‐validation and evaluated the performance of all constructed models using four commonly used regression model evaluation metrics: the coefficient of determination (R 2), the mean absolute error (MAE), the root mean square error (RMSE), and the Spearman correlation coefficient (ρ) (Chai and Draxler 2014; Renaud and Victoria‐Feser 2010; Sedgwick 2014). The calculation formulas for the aforementioned four evaluation metrics are as follows:
| (1) |
where is the true value, is the predicted value, represents the mean of all actual observations, represents the rank difference between the true value and the predicted value , and n represents the size of the sample.
4.6. Model comparisons
We compared scut_ProFP with several existing methods, including predictive frameworks of the same type, as well as supervised and unsupervised deep learning models.
PyPEF. PyPEF is a generalized protein engineering framework for performing data‐driven protein engineering that combines signal processing and statistical physics, as proposed by Siedhoff et al. (Siedhoff et al. 2021). We downloaded the source code of PyPEF from https://github.com/Protein-Engineering-Framework/PyPEF.
PySAR. PySAR is a sequence activity relationship predictive model that combines protein spectra and protein descriptors, as proposed by Mckenna et al. (Mckenna and Dubey 2022). We downloaded the source code of PySAR from https://github.com/amckenna41/pySAR.
ECNet. ECNet is an integrated neural network model that combines local evolutionary and global evolutionary information from homologous sequences, along with the original sequence information, as proposed by Luo et al. (Luo et al. 2021). We set up files required to run ECNet as described in the respective GitHub repository available at https://github.com/luoyunan/ECNet. We used uniclust‐30 database (version uniclust30_2018_08) as sequence database setting parameters for searching homologous sequences as described by Luo et al. We are running ECNet in ensemble mode.
EVmutation. EVmutation is an unsupervised probabilistic model based on evolutionary information proposed by Hopf et al. (Hopf et al. 2017). It explains epistasis by explicitly modeling interactions between all the pairs of residues in proteins, and then quantifies the effects of mutations, including multiple mutations, simultaneously. In this work, we cloned the source code of EVmutation from https://github.com/debbiemarkslab and followed the description therein to generate the predictions of EVmutation.
UniRep. UniRep is an unsupervised protein language model proposed by Alley et al. (Alley et al. 2019). It can convert the entire enzyme sequence into 1900‐dimensional feature vectors to train a top model. We downloaded the source code of UniRep from https://github.com/churchlab/UniRep. In this study, to be fair, we selected the GBR algorithm as the top model for UniRep.
AUTHOR CONTRIBUTIONS
Zhihui Zhang: Conceptualization; methodology; software; formal analysis; writing – original draft. Zhixuan Li: Methodology; validation; investigation. Qianyue Wang: Software; data curation. Hanlin Wu: Software. Manli Yang: Supervision; project administration. Fengguang Zhao: Supervision; project administration. Mingkui Tan: Methodology; resources. Shuangyan Han: Conceptualization; resources; writing – review and editing; funding acquisition.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Supporting information
Data S1. Supporting Information.
Table S5. Ranking of all CreiLOV mutants in each round of machine learning prediction.
ACKNOWLEDGMENTS
This work was supported by the National Key R&D Program of China (No. 2021YFC2100400). We would like to thank EditChecks (https://editchecks.com.cn/) for providing linguistic assistance during the preparation of this manuscript.
Zhang Z, Li Z, Wang Q, Wu H, Yang M, Zhao F, et al. A protein fitness predictive framework based on feature combination and intelligent searching. Protein Science. 2024;33(12):e5211. 10.1002/pro.5211
Review Editor: Nir Ben‐Tal
REFERENCES
- Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence‐based deep representation learning. Nat Methods. 2019;16:1315–1322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anowar F, Sadaoui S, Selim B. Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t‐sne). Comput Sci Rev. 2021;40:100378. [Google Scholar]
- Biou V, Gibrat JF, Levin JM, Robson B, Garnier J. Secondary structure prediction: combination of three different methods. Protein Eng des Sel. 1988;2:185–191. [DOI] [PubMed] [Google Scholar]
- Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Epistasis as the primary factor in molecular evolution. Nature. 2012;490:535–538. [DOI] [PubMed] [Google Scholar]
- Cadet F, Fontaine N, Li G, Sanchis J, Ng Fuk Chong M, Pandjaitan R, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci Rep. 2018;8:16757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carugo O. Amino acid composition and protein dimension. Protein Sci. 2008;17:2187–2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci Model Dev. 2014;7:1247–1250. [Google Scholar]
- Chen L, Zhang Z, Li Z, Li R, Huo R, Chen L, et al. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst. 2023a;14:706–721.e5. [DOI] [PubMed] [Google Scholar]
- Chen Y, Hu R, Li K, Zhang Y, Fu L, Zhang J, et al. Deep mutational scanning of an oxygen‐independent fluorescent protein CreiLOV for comprehensive profiling of mutational and epistatic effects. ACS Synth Biol. 2023b;12:1461–1473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chou K‐C. Prediction of protein subcellular locations by incorporating quasi‐sequence‐order effect. Biochem Biophys Res Commun. 2000;278:477–483. [DOI] [PubMed] [Google Scholar]
- Fontaine NT, Cadet XF, Vetrivel I. Novel descriptors and digital signal processing‐based method for protein sequence activity relationship study. Int J Mol Sci. 2019;20:5640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freschlin CR, Fahlberg SA, Romero PA. Machine learning to navigate fitness landscapes for protein engineering. Curr Opin Biotechnol. 2022;75:102713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CPI, Springer M, Sander C, et al. Mutation effects predicted from sequence co‐variation. Nat Biotechnol. 2017;35:128–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay‐labeled data. Nat Biotechnol. 2022;40:1114–1122. [DOI] [PubMed] [Google Scholar]
- Hu R, Fu L, Chen Y, Chen J, Qiao Y, Si T. Protein engineering via Bayesian optimization‐guided evolutionary algorithm and robotic experiments. Brief Bioinform. 2023;24:bbac570. [DOI] [PubMed] [Google Scholar]
- Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28:374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang Y, Liu S, Zhang S. Geary autocorrelation and DCCA coefficient: application to predict apoptosis protein subcellular localization via PSSM. Physica A. 2017;467:296–306. [Google Scholar]
- Lundberg SM, Lee S‐I. A unified approach to interpreting model predictions. 31st conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2017.
- Luo Y, Jiang G, Yu T, Liu Y, Vo L, Ding H, et al. ECNet is an evolutionary context‐integrated deep learning framework for protein engineering. Nat Commun. 2021;12:5743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mckenna A, Dubey S. Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. J Biomed Inform. 2022;128:104016. [DOI] [PubMed] [Google Scholar]
- Ofer D, Linial M. ProFET: feature engineering captures high‐level protein functions. Bioinformatics. 2015;31:3429–3436. [DOI] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit‐learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
- Pudil P, Novovičová J, Kittler J. Floating search methods in feature selection. Pattern Recognit Lett. 1994;15:1119–1125. [Google Scholar]
- Raghava GP, Han JH. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinf. 2005;6:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Renaud O, Victoria‐Feser M‐P. A robust coefficient of determination for regression. J Stat Plan Inference. 2010;140:1852–1862. [Google Scholar]
- Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol. 2009;10:866–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes. Proc Natl Acad Sci U S A. 2013;110:E193–E201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem. 1998;41:2481–2491. [DOI] [PubMed] [Google Scholar]
- Sankara Narayanan P, Runthala A. Accurate computational evolution of proteins and its dependence on deep learning and machine learning strategies. Biocatal Biotransformation. 2022;40:169–181. [Google Scholar]
- Sedgwick P. Spearman's rank correlation coefficient. BMJ. 2014;349:g7327. [DOI] [PubMed] [Google Scholar]
- Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104:4337–4341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siedhoff NE, Illig AM, Schwaneberg U, Davari MD. PyPEF—an integrated framework for data‐driven protein engineering. J Chem Inf Model. 2021;61:3463–3476. [DOI] [PubMed] [Google Scholar]
- Tanaka S, Scheraga HA. Medium‐and long‐range interaction parameters between amino acids for predicting three‐dimensional structures of proteins. Macromolecules. 1976;9:945–950. [DOI] [PubMed] [Google Scholar]
- Tian F, Zhou P, Li Z. T‐scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides. J Mol Struct. 2007;830:106–115. [Google Scholar]
- Wang Y, Xue P, Cao M, Yu T, Lane ST, Zhao H. Directed evolution: methodologies and applications. Chem Rev. 2021;121:12384–12444. [DOI] [PubMed] [Google Scholar]
- Wittmann BJ, Yue Y, Arnold FH. Informed training set design enables efficient machine learning‐assisted directed protein evolution. Cell Syst. 2021;12:1026–1045.e7. [DOI] [PubMed] [Google Scholar]
- Worth CL, Preissner R, Blundell TL. SDM—a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res. 2011;39:W215–W222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y, Verma D, Sheridan RP, Liaw A, Ma J, Marshall NM, et al. Deep dive into machine learning models for protein engineering. J Chem Inf Model. 2020;60:2773–2790. [DOI] [PubMed] [Google Scholar]
- Yang KK, Wu Z, Arnold FH. Machine‐learning‐guided directed evolution for protein engineering. Nat Methods. 2019;16:687–694. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1. Supporting Information.
Table S5. Ranking of all CreiLOV mutants in each round of machine learning prediction.
