Table 1.
Summary of encodings of protein sequences, models, and acquisition functions tested in this work
| Encoding | Dimension per Residue | Description | |
|---|---|---|---|
| AAIndex | 4 | Continuous fixed amino acid descriptors | |
| Georgiev71 | 19 | Continuous fixed amino acid descriptors | |
| Onehot | 20 | Categorical (which amino acid) | |
| ESM233 | 1280 | Learned embedding from a protein language model (ESM2 with 650 million parameters) | |
| Model | Bayesian? | Deep Learning? | Description |
| Boosting Ensemble | N | N | An ensemble of 5 boosting models |
| Gaussian Process (GP) | Y | N | A collection of continuous functions described by a posterior |
| DNN Ensemble | N | Y | An ensemble of 5 multilayer perceptrons (deep neural networks, DNNs) |
| Deep Kernel Learning (DKL)29 | Y | Y | A GP on the last layer of a deep neural network |
| Acquisition Function | Deterministic? | Description | |
| Greedy | Y | Acquires the maximum value of the mean from the posterior | |
| Upper Confidence Bound (UCB) | Y | Acquires the maximum value of a certain confidence interval from the posterior (tuned by a hyperparameter) | |
| Thompson Sampling (TS) | N | Acquires the maximum value of a random function sampled from the posterior | |