. 2022 Oct 10;36:gzac009. doi: 10.1093/protein/gzac009

Table I.

Summary of statistical, machine learning, and deep learning models for biocatalyst engineering reported in the past 5 years^a

Section	Input	Predictive Model	Output	Performance	Paper
3.1	Structural feature from trajectories of QM/MM transition interface sampling	Logistic regression	Reactive trajectory classifier	AUC = 0.89 Accuracy = 82%	2019-Bonk (Bonk et al., 2019)
3.1	QM coordinate	Elastic net regression	Activation free energy	RMSD = 4.46 kcal/mol R² = 0.28	2019-Esch (von der Esch et al. (2019))
3.1	QM coordinate MM coordinate MM charge	ANN	Energy	RMSD = 0.69 kcal/mol	2021-Pan (Pan et al., 2021)
3.2	Substrate structure	Support vector machine	Activity score	Accuracy~80%	2017-Pertusi (Pertusi et al., 2017)
3.2	Sequence	Supervised machine learning decision tree	Substrate specificity	Accuracy = 0.94	2017-Chevrette (Chevrette et al., 2017)
3.2	Sequence Enzyme structure	Phylogenetic analysis Rosetta design calculation	Multipoint mutation	100% active designs 10- to 4,000-fold higher efficiencies	2018-Khersonsky (Khersonsky et al., 2018)
3.2	Physicochemical property Structural parameter	Decision tree	Activity classifier	Accuracy~90%	2018-Yang (Yang et al., 2018)
3.2	Graph kernel derived from protein coordinates	Gaussian process	Activity	Pearson r = 0.81	2020-Voutilainen (Voutilainen et al., 2020)
3.2	AA descriptor	CNN	AA type probability	~70% in predicting the natural AA type	2020-Shroff (Shroff et al., 2020)
3.2	Physicochemical feature of enzyme-substrate pairs	Classification: Random forest Regression: Random forest	Activity classifier Activity regressor	AUC = 0.89 R² = 0.75	2020-Robinson (Robinson et al., 2020)
3.2	Rosetta docking score Electronic structure descriptor Active-site descriptor	Logistic regression Random forest Gradient-boosted decision trees Support vector machines	Activity classifier	Accuracy = ~ 82% ROC = 0.9	2021-Mou (Mou et al., 2021)
3.2	Sequence Substrate connectivity	Classification: CNN Regression: CNN	Activity classifier Activity regressor	AUROC = 0.94 Spearman ρ = 0.89	2022-Xu (Xu et al., 2022)
3.3	Sequence	Partial least squares regression	Activation free energy	R² = 0.96	2018-Cadet (Cadet et al., 2018)
3.3	Sequence	Gradient boosting	Enantioselectivity	Pearson r = 0.65	2019-Wu (Wu et al., 2019)
3.3	Sequence AA descriptor	CNN	Activity	AUROC = 0.88	2020-Xu (Xu et al., 2020)
3.4	Sequence Reaction signature-based features	Classification: Gaussian process Regression: Gaussian process	Reaction probability classifier K_M regressor	AUC = 0.91 Q² = 0.78^b	2016-Mellor (Mellor et al., 2016)
3.4	Structural features of enzyme mutants	Elastic net regularization	k_cat/K_M	Pearson r = 0.76 Spearman ρ = 0.55	2016-Carlin (Carlin et al., 2016)
3.4	Genome-scale metabolic parameter Enzyme structure Biochemistry property Kinetic assay condition	Elastic net Random forest DNN	k_app,max	R² = 0.76	2018-Heckmann (Heckmann et al., 2018)

(continue)