Skip to main content
. 2022 Oct 10;36:gzac009. doi: 10.1093/protein/gzac009

Table I.

Summary of statistical, machine learning, and deep learning models for biocatalyst engineering reported in the past 5 yearsa

Section Input Predictive Model Output Performance Paper
3.1 Structural feature from trajectories of QM/MM transition interface sampling Logistic regression Reactive trajectory classifier
  • AUC = 0.89

  • Accuracy = 82%

2019-Bonk (Bonk et al., 2019)
3.1 QM coordinate Elastic net regression Activation free energy
  • RMSD = 4.46 kcal/mol

  • R2 = 0.28

2019-Esch (von der Esch et al. (2019))
3.1
  • QM coordinate

  • MM coordinate

  • MM charge

ANN Energy RMSD = 0.69 kcal/mol 2021-Pan (Pan et al., 2021)
3.2 Substrate structure Support vector machine Activity score Accuracy~80% 2017-Pertusi (Pertusi et al., 2017)
3.2 Sequence Supervised machine learning decision tree Substrate specificity Accuracy = 0.94 2017-Chevrette (Chevrette et al., 2017)
3.2
  • Sequence

  • Enzyme structure

  • Phylogenetic analysis

  • Rosetta design calculation

Multipoint mutation
  • 100% active designs

  • 10- to 4,000-fold higher efficiencies

2018-Khersonsky (Khersonsky et al., 2018)
3.2
  • Physicochemical property

  • Structural parameter

Decision tree Activity classifier Accuracy~90% 2018-Yang (Yang et al., 2018)
3.2 Graph kernel derived from protein coordinates Gaussian process Activity Pearson r = 0.81 2020-Voutilainen (Voutilainen et al., 2020)
3.2 AA descriptor CNN AA type probability ~70% in predicting the natural AA type 2020-Shroff (Shroff et al., 2020)
3.2 Physicochemical feature of enzyme-substrate pairs
  • Classification: Random forest

  • Regression: Random forest

  • Activity classifier

  • Activity regressor

  • AUC = 0.89

  • R2 = 0.75

2020-Robinson (Robinson et al., 2020)
3.2
  • Rosetta docking score

  • Electronic structure descriptor

  • Active-site descriptor

  • Logistic regression

  • Random forest

  • Gradient-boosted decision trees

  • Support vector machines

  • Activity classifier

  • Accuracy = ~ 82%

  • ROC = 0.9

2021-Mou (Mou et al., 2021)
3.2
  • Sequence

  • Substrate connectivity

  • Classification: CNN

  • Regression: CNN

  • Activity classifier

  • Activity regressor

  • AUROC = 0.94

  • Spearman ρ = 0.89

2022-Xu (Xu et al., 2022)
3.3 Sequence Partial least squares regression Activation free energy R2 = 0.96 2018-Cadet (Cadet et al., 2018)
3.3 Sequence Gradient boosting Enantioselectivity Pearson r = 0.65 2019-Wu (Wu et al., 2019)
3.3
  • Sequence

  • AA descriptor

CNN Activity AUROC = 0.88 2020-Xu (Xu et al., 2020)
3.4
  • Sequence

  • Reaction signature-based features

  • Classification: Gaussian process

  • Regression: Gaussian process

  • Reaction probability classifier

  • KM regressor

  • AUC = 0.91

  • Q2 = 0.78b

2016-Mellor (Mellor et al., 2016)
3.4
  • Structural features of enzyme mutants

Elastic net regularization kcat/KM
  • Pearson r = 0.76

  • Spearman ρ = 0.55

2016-Carlin (Carlin et al., 2016)
3.4
  • Genome-scale metabolic parameter

  • Enzyme structure

  • Biochemistry property

  • Kinetic assay condition

  • Elastic net

  • Random forest

  • DNN

kapp,max R2 = 0.76 2018-Heckmann (Heckmann et al., 2018)

(continue)