Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Jan 7;21(1):e1012639. doi: 10.1371/journal.pcbi.1012639

Benchmarking uncertainty quantification for protein engineering

Kevin P Greenman 1,2,3,, Ava P Amini 4,*, Kevin K Yang 4,*
Editor: Rachel Kolodny5
PMCID: PMC11741572  PMID: 39775201

Abstract

Machine learning sequence-function models for proteins could enable significant advances in protein engineering, especially when paired with state-of-the-art methods to select new sequences for property optimization and/or model improvement. Such methods (Bayesian optimization and active learning) require calibrated estimations of model uncertainty. While studies have benchmarked a variety of deep learning uncertainty quantification (UQ) methods on standard and molecular machine-learning datasets, it is not clear if these results extend to protein datasets. In this work, we implemented a panel of deep learning UQ methods on regression tasks from the Fitness Landscape Inference for Proteins (FLIP) benchmark. We compared results across different degrees of distributional shift using metrics that assess each UQ method’s accuracy, calibration, coverage, width, and rank correlation. Additionally, we compared these metrics using one-hot encoding and pretrained language model representations, and we tested the UQ methods in retrospective active learning and Bayesian optimization settings. Our results indicate that there is no single best UQ method across all datasets, splits, and metrics, and that uncertainty-based sampling is often unable to outperform greedy sampling in Bayesian optimization. These benchmarks enable us to provide recommendations for more effective design of biological sequences using machine learning.

Author summary

Protein engineering has previously benefited from the use of machine learning models to guide the choice of new experiments. In many cases, the goal of conducting new experiments is optimizing for a property or improving the machine learning model. Many standard methods for these two tasks require good estimates of the uncertainty in the model’s predictions. Several methods for quantifying this uncertainty exist and have been benchmarked on datasets from other domains (e.g. small molecules), but it is not clear whether these results also apply for proteins. To address this, we evaluated a range of uncertainty quantification approaches on tasks derived from a protein-focused benchmark dataset. We tested performance on different degrees of distributional shift between the training and testing sets and on different representations of the sequences, and we assessed performance in terms of several standard metrics. Finally, we used the uncertainties for property optimization and model improvement. Our findings indicate that no single uncertainty estimation method excels across all scenarios. Moreover, uncertainty-based strategies for property optimization often did not outperform simpler methods that did not consider uncertainty. This research offers insights for the more efficacious application of machine learning in the realm of biological sequence design.

Introduction

Machine learning (ML) has already begun to accelerate the field of protein engineering by providing low-cost predictions of phenomena that require time- and resource-intensive labeling by experiments or physics-based simulations [1]. It is often necessary to have an estimate of model uncertainty in addition to the property prediction, as the performance of an ML model can be highly dependent on the domain shift between its training and testing data [2]. Because protein engineering data is often collected in a manner that violates the independent and identically distributed (i.i.d.) assumptions of many ML approaches [3], tailored ML methods are required to guide the selection of new experiments from a protein landscape. Uncertainty quantification (UQ) can inform the selection of experiments in order to improve a ML model or optimize protein function through active learning (AL) or Bayesian optimization (BO).

In chemistry and materials science, several studies have benchmarked common UQ methods against one another on standard datasets and have used or developed appropriate metrics to quantify the quality of these uncertainty estimates [49]. These works have illustrated that the best choice of UQ method can depend on the dataset and other considerations such as representation and scaling. While some protein engineering work has leveraged uncertainty estimates, these studies have been mostly limited to single UQ methods such as convolutional neural network (CNN) ensembles [10] or Gaussian processes (GPs) [11, 12].

Gruver et al. compared CNN ensembles to GPs (using traditional representations and pre-trained BERT [13] language model embeddings) in Bayesian optimization tasks [14]. They found that CNN ensembles are often more robust to distribution shift than other types of models. Additionally, they report that most model types have more poorly calibrated uncertainties on out-of-domain samples. However, a more comprehensive study of CNN UQ methods, evaluated using a variety of uncertainty quality metrics, has not been done. A comparison of uncertainty methods on different protein representations (e.g., one-hot encodings or embeddings from protein language models) in an active learning setting is also lacking.

In this work, we evaluate a panel of UQ methods for protein sequence-function prediction on a set of standardized, public protein datasets (Fig 1). Our chosen datasets included splits with varied degrees of domain extrapolation, which enabled method evaluation in a setting similar to what might be experienced while collecting new experimental data for protein engineering. We assessed each model using a variety of metrics that captured different aspects of desired performance, including accuracy, calibration, coverage, width, and rank correlation. Additionally, we compared the performance of the UQ methods on one-hot encoded sequence representations and on embeddings computed from the ESM-1b protein masked language model [15]. We find that the quality of UQ estimates are dependent on the landscape, task, and embedding, and that no single method consistently outperforms all others. We also evaluated the UQ methods in an active learning setting with several acquisition functions, and demonstrated that uncertainty-based sampling often outperforms random sampling (especially in later stages of active learning), although better calibrated uncertainty does not necessarily equate to better active learning. Finally, we tested the UQ methods in Bayesian optimization and found that while BO typically outperformed random sampling, none were better than a greedy baseline. We envision that the understanding gained from this work will enable more effective development and application of UQ techniques to machine learning in protein engineering.

Fig 1. Approach, datasets, and tasks.

Fig 1

(A) Schematic of the approach for benchmarking uncertainty quantification (UQ) in machine learning for protein engineering. A panel of UQ methods were evaluated on protein fitness datasets to assess the quality of the uncertainty estimates and their utility in active learning and Bayesian optimization. (B) Our study utilized three protein datasets/landscapes and different train-validation-test split tasks within each dataset. These datasets and tasks covered a range of sample diversities and domain shifts (task difficulties).

Results and discussion

Uncertainty quantification

Our first goal was to evaluate the calibration and quality of a variety of UQ methods. We implemented seven uncertainty methods for this benchmark: linear Bayesian ridge regression (BRR) [16, 17], Gaussian processes (GPs) [18], and five methods using variations on a convolutional neural network (CNN) architecture. The CNN implementation from FLIP [3] provided the core architecture used by our dropout [19], ensemble [20], evidential [21], mean-variance estimation (MVE) [22], and last-layer stochastic variational inference (SVI) [23] methods. Additional model details are provided in the Methods section.

The landscapes used in this work were taken from the Fitness Landscape Inference for Proteins (FLIP) benchmark [3]. These include the binding domain of an immunoglobulin binding protein (GB1), adeno-associated virus stability (AAV), and thermostability (Meltome) data landscapes, which cover a large sequence space and a broad range of protein families. The FLIP benchmark includes several train-test splits, or tasks, for each landscape. Most of these tasks are designed to mimic common, real-world data collection scenarios and are thus a more realistic assessment of generalization than random train-test splits. However, random splits are also included as a point of reference. We chose 8 of the 15 FLIP tasks to benchmark the panel of uncertainty methods. We selected these tasks to be representative of several regimes of domain shift—random sampling with no domain shift (AAV/Random, Meltome/Random, and GB1/Random); the highest (and most relevant) domain-shift regimes (AAV/Random vs. Designed and GB1/1 vs. Rest); and less aggressive domain shifts (AAV/7 vs. Rest, GB1/2 vs. Rest, and GB1/3 vs. Rest). The Datasets section of the Methods provides notes on the nomenclature used for these tasks.

We trained the seven models on each of the eight tasks described above and evaluated their performance on the test set using the metrics described in the Evaluation Metrics section. We compare model calibration and accuracy in Fig 2 and the percent coverage versus average width relative to range in Fig 3. These figures illustrate the results for models trained on the embeddings from a pretrained ESM language model [15]; the corresponding results using one-hot encodings are shown in Figs A and B in S1 Appendix.

Fig 2. Miscalibration area vs. root mean square error (RMSE).

Fig 2

For the (A) AAV, (B) Meltome, and (C) GB1 landscapes. Miscalibration area (also called the area under the calibration error curve or AUCE) quantifies the absolute difference between the calibration plot and perfect calibration. It is desirable to have a model that is both accurate and well-calibrated, so the best performing points are those closest to the lower left corner of the plots. Each point represents an average of 5 models trained using different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig A in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of points for which experiments were not feasible (e.g. there is no GP Continuous model result for the AAV landscape due to memory constraints for training these models).

Fig 3. Coverage vs. average width / range.

Fig 3

For the (A) AAV, (B) Meltome, and (C) GB1 landscapes. Coverage is the percentage of true values that fall within the 95% confidence interval (±2σ) of each prediction, and the width is the size of the 95% confidence region relative to the range of the training set (4σ/R where R is the range of the training set). A good model exhibits high coverage and low width, which corresponds to the upper left of each plot. The horizontal dashed line indicates 95% coverage. Each point represents an average of 5 models trained using different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig B in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of several points for which experiments were not feasible (e.g. there is no GP Continuous model result for the AAV landscape due to memory constraints for training these models).

As expected, the splits with the least required domain extrapolation tend to have more accurate models (lower RMSE; Fig 2). However, the relationship between miscalibration area and extrapolation is less clear; some models are highly calibrated on the most difficult (highest domain shift) splits, while others are poorly calibrated even on random splits. There is no single method that performs consistently well across splits and landscapes, but some trends can be observed. For example, ensembling is often one of the highest accuracy CNN models, but also one of the most poorly calibrated. Additionally, GP and BRR models are often better calibrated than CNN models. For the AAV and GB1 landscapes (Fig 2a and 2c), model miscalibration area usually increases slightly while RMSE increases more substantially with increasing domain shift.

In addition to accuracy and calibration, we assessed each method in terms of the coverage and width of its uncertainty estimates. A good uncertainty method results in high coverage (ideally, the true value falls within the 95% confidence region 95% of the time) while still maintaining a small average width. The latter is necessary because predicting a very large and uniform value of uncertainty for every point would result in good coverage, so coverage alone is not sufficient. Fig 3 illustrates that many methods perform relatively well in either coverage or width (corresponding to the the top and left limits of the plot, respectively), but few methods perform well in both. Similarly to Fig 2, there is some observable trend that more challenging splits are further from the optimal part (upper left) of the plot; this trend is more clear for the GB1 splits (Fig 3b) than for the AAV splits. Most models trained on the AAV landscape (Fig 3a) have a similar average width/range ratio for all splits, but for the GB1 landscape (Fig 3c), this ratio typically increases as the domain shift increases. The locations of the sets of points for each model type shared some similarities across landscapes. CNN SVI often has low coverage and low width, CNN MVE often has moderate coverage and moderate width, and CNN Evidential and BRR often have high coverage and high width. These trends across landscapes could point to a general problem of under- or over-confidence with some model types, and indicates that post-hoc calibration may be necessary. The results for all prediction and uncertainty metrics (along with their standard deviations across 5 different initialization seeds) are shown in Tables A to AR in S1 Appendix.

We next assessed how target predictions and uncertainty estimates depended on the degree of domain shift. Across datasets and splits, we compared the ranking performance of each method in terms of predictions relative to true values and uncertainty estimates relative to true errors (ESM in Fig 4 and one-hot encodings (OHE) in Fig C in S1 Appendix). The splits are ordered according to domain shift within their respective landscapes (lowest to highest shift from left to right). We observe that the rank correlation of the predictions to the true labels generally decreases moving from less to more domain shift within a landscape, consistent with expectation, with the exception of AAV/Random vs. Designed models performing better than AAV/7 vs. Rest models (Fig 4a). Most methods exhibit similar performance in Spearman rank correlations of predictions to targets (ρ) within the same task. For many tasks, GP and BRR models perform as well or better than CNN models. Performance on Spearman rank correlations of uncertainties to prediction residuals (ρunc) is generally much worse than that on ρ, with some results showing negative correlation (Fig 4b). MVE and evidential uncertainty methods are most performant in ρunc for most cases of low to moderate domain shift. Most methods have ρunc near zero for the most challenging splits. Despite the relatively good performance of MVE on tasks with low to moderate domain shift, it performs poorly in cases of high domain shift, which is consistent with its intended use as an estimator of aleatoric (data-dependent) uncertainty.

Fig 4. Spearman rank correlations.

Fig 4

Of (A) predictions (ρ) and (B) uncertainties (ρunc) vs. extrapolation. Within each landscape (AAV, Meltome, and GB1), splits are ordered by the amount of domain shift between train and test sets, with the lowest domain shift on the left and the highest domain shift on the right. Error bars on the CNN results represent the 95% confidence interval calculated from 5 different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Fig C in S1 Appendix shows the corresponding results for the OHE representation. See the Uncertainty Methods section for an explanation of several points for which experiments were not feasible.

We find that the models trained on ESM embeddings outperform those trained on one-hot encodings in 21 out of 51 cases for rank correlation of test set predictions, and 29 out of 51 cases for rank correlation of test set uncertainties. The relative performance of the two representations on prediction and uncertainty rank correlation is shown in Fig D in S1 Appendix. In terms of predictions, ESM embeddings often yield substantially better performance for tasks with high domain shift (e.g. GB1/1 vs. Rest and Meltome/Random), while OHE performs slightly better on tasks with lower domain shift (e.g. AAV/Random and GB1/3 vs. Rest). The relative uncertainty rank correlation performance, on the other hand, does not have a clear relationship to domain shift.

Since there is no single best UQ method across datasets, splits, and metrics, it is prudent for practitioners to quantify the performance of uncertainty estimates on each new task and to prioritize metrics according to the situation (e.g. prioritize high coverage over low width in high-risk or safety-critical situations).

Active learning

In protein engineering, the purpose of uncertainty estimation is typically to intelligently prioritize sample acquisition to facilitate downstream experimentation. One such use case is in active learning, where uncertainty estimates are used to inform sampling with the goal of improving model predictions overall (i.e., to achieve an accurate model with less training data; Fig 5a). Having assessed the calibration and accuracy of the panel of UQ methods above, we next evaluated whether uncertainty-based active learning could make the learning process more sample-efficient. Across all datasets and splits using the pretrained ESM embeddings, data acquisition was simulated as iterative selection from the data library according to a given sampling strategy (acquisition function; see Methods for details). The results are summarized in Fig 5 for Spearman rank correlation (ρ) on three methods and one split per landscape, and additional results are shown in the Figs H to BH in S1 Appendix for other metrics, uncertainty methods, and splits. Across most models, the performance difference between the start of active learning (10% of training data) and end of active learning (100% of training data) is relatively small, and many models begin to plateau in performance before reaching 100% of training data. In addition to the active learning experiments run with 10% of the training data in the initial sample, we also show results starting with 1% and 5% of the training data in Figs F and G in S1 Appendix, respectively. These results showed worse initial performance given the smaller initial training data, but otherwise similar trends to the trials starting at 10%.

Fig 5. Active learning.

Fig 5

(A) Schematic of active learning approach. A model is trained on an initial dataset, and is then retrained in each iteration by adding more points to the training set based on some selection criteria. (B-D) Uncertainty-guided active learning in protein sequence-function prediction. Spearman rank correlation of predictions (ρ) for the CNN ensemble, CNN evidential, and GP methods evaluated on the AAV/Random (B), Meltome/Random (C), and GB1/Random (D) splits. The “random” strategy acquired sequences with all unseen points having equal probabilities, the “explorative sample” strategy acquired sequences with random sampling weighted by uncertainty, and the “explorative greedy” strategy acquired the previously unseen sequences with the highest uncertainty. See the Uncertainty Methods section for an explanation of why GP experiments for the AAV landscape were not feasible.

The “explorative greedy” and “explorative sample” acquisition functions (which sample based on uncertainty alone or sample randomly weighted by uncertainty, respectively) sometimes outperform random sampling, but this is not true across all methods and landscapes (Fig 5b–5d). In some cases, the performance of the uncertainty-based sampling strategies also varies depending on the fraction of the total training data available to the model. For example, for the Meltome/Random split and CNN evidential model (Fig 5c), explorative greedy sampling results in a decrease in model performance after the first round of active learning while the explorative sample strategy increases performance. By the fourth round of active learning for this task, the two explorative strategies outperform random sampling. This indicates that in the early stages of active learning when a model’s uncertainty estimates are poorly calibrated, it may be advantageous to sample with at least some randomness included in an uncertainty-based acquisition function. We also analyzed how the mean test set uncertainty changed as more data was acquired during active learning. Fig E in S1 Appendix illustrates that in some cases, the mean test set uncertainty decreased with increasing training data, while in other cases, it increased. While one might expect adding more data from uncertain sequences would always cause a decrease in the mean uncertainty, this assumes that (1) the added training data comes from the same distribution as the test data, and (2) the uncertainty estimates are well-calibrated at the beginning and remain so after retraining with more data. Overall, the results indicate that uncertainty-informed active learning can outperform random sampling and thus lead to more accurate machine learning models with fewer training points needing to be measured (Fig 5b–5d).

Bayesian optimization

Uncertainty estimates can also be leveraged to identify top-performing sequences with as few samples as possible. The true objective can be approximated with a surrogate model, and we can use the predictions of this model as well as the uncertainties in these model predictions to guide a search toward higher or lower values of the true objective. This approach, referred to as Bayesian optimization, can also be represented by Fig 5a. Bayesian optimization methods use an acquisition function computed from the predicted mean and uncertainty values to trade off exploration and exploitation when choosing new sequences to sample, in contrast to the acquisition functions used in active learning that maximize exploration.

We compared two popular acquisition functions (upper-confidence bound (UCB) and Thompson sampling (TS)) against random and greedy baselines (see Methods for details). UCB and TS are intended to accelerate identification of top-performing instances over random and greedy baselines by taking into account both the predicted objective and the uncertainty in that prediction. Fig 6 shows the % of top-100 scores found versus fraction of training data seen for three UQ methods and one split per landscape. Across these cases, the uncertainty-based methods almost always perform better than the random baseline but never outperform greedily sampling the sequences with the highest predicted values. In most cases, UCB sampling performs about the same as greedy sampling, with the notable exception of evidential uncertainty on the AAV sampled dataset (Fig 6a), for which it performed worse than random sampling. The performance of TS (a probabalistic method) was typically intermediate between greedy/UCB (deterministic methods) and random sampling. Overall, the uncertainty-based methods do not outperform greedy baselines, suggesting that better UQ methods are needed for protein engineering or that the landscapes studied here are simple enough that they can be optimized by pure exploitation.

Fig 6. Bayesian optimization.

Fig 6

(A-C) Bayesian optimization in protein sequence-function prediction. % of top-100 scores in training set found for the CNN ensemble, CNN evidential, and GP methods evaluated on the AAV/Random (A), Meltome/Random (B), and GB1/Random (C) splits. The “greedy” strategy acquired sequences with the best predicted property values. The “UCB” and “TS” strategies acquired sequences based on the upper confidence bound (UCB) and Thompson sampling (TS) approaches, respectively. The “random” strategy acquired sequences with all unseen points having equal probabilities. See the Uncertainty Methods section for an explanation of why GP experiments for the AAV landscape were not feasible. Note that in several plots, including the Gaussian process plots for Meltome and GB1 and the evidential plot for Meltome, the “greedy” strategy performance is nearly identical to and is covered up by the “UCB” strategy.

Conclusions

Calibrated uncertainty estimations for ML predictions of biomolecular properties are necessary for effective model improvement using active learning or property optimization using Bayesian methods. In this work, we benchmarked a panel of uncertainty quantification (UQ) methods on protein datasets, including on train-test splits that are representative of real-world data collection practices. After evaluating each method based on accuracy, calibration, coverage, width, rank correlation, and performance in active learning and Bayesian optimization, we find that there is no method that performs consistently well across all metrics or all landscapes and splits.

We also examined how models trained using one-hot-encoding representations of sequences compare to those trained on more informative and generalizable representations such as embeddings from a pretrained ESM language model. This comparison illustrated that while the pretrained embeddings do improve model accuracy and uncertainty correlation/calibration in some cases, particularly on splits with higher domain shifts, this is not universally true and in some cases makes performance worse.

While the UQ evaluation metrics used in this work provide valuable information, they are ultimately only a proxy for expected performance in Bayesian optimization and active learning. We found that UQ evaluation metrics are not well-correlated with gains in accuracy from one active learning iteration to another on these datasets. This suggests that future work in UQ should include retrospective Bayesian optimization and/or active learning studies rather than relying on UQ evaluation metrics alone. Our retrospective active learning studies using holdouts of the training sets demonstrate that many of the uncertainty methods outperform random sampling baselines. In some of our experiments, we observe that the uncertainty-based sampling strategies perform worse than random sampling during the earliest stages of active learning, then perform better as a model’s accuracy and quality of uncertainty estimates improve in later stages. Our Bayesian optimization experiments demonstrate that while uncertainty-based methods typically perform better than a random approach, including uncertainty in the acquisition function does not necessarily confer a benefit over a greedy approach that considers only property predictions. While previous work has successfully used BO to optimize proteins [2427], it is not clear that uncertainty helped these campaigns because they do not compare directly to greedy sampling. Taken together, these results indicate that there is a need for further development of UQ methods and/or sampling strategies to improve protein engineering performance in AL and BO settings.

Future work in this area could expand on methods (e.g. Bayesian neural networks [28] and conformal prediction [29, 30]), metrics (e.g. sharpness [5], dispersion [31], and tightness [32]), and representations (e.g. ESM-2 [33] or using an attention layer rather than mean aggregation on our ESM-1b embeddings). In addition to further study of existing methods, future work should focus on designing novel UQ methods that give a clear performance benefit in AL and BO for protein engineering. Future work could also examine other sampling regimes for AL and BO, such as training an initial model on random data and sampling from designed sequences. While this work considered uncertainty predictions as directly output by the models, further study is needed to understand the effects of post-hoc calibration methods (e.g. scalar recalibration [31] or CRUDE [34]). Future work should consider additional active learning and Bayesian optimization strategies, such as those that consider batch diversity in the acquisition function [35], and methods that consider the desired domain shift. Ultimately, this work contributes to a more thorough understanding of the performance and utility of UQ for sequence-function models and provides a foundation for future work to enable more effective protein engineering.

Methods

Regression tasks

All tasks studied in this work are regression problems, in which we attempt to fit a model to a dataset with D data points (xi, yi). xi is a protein sequence representation (either a one-hot encoding or an embedding vector from an ESM language model), and yiR is a scalar-valued target property from the protein landscapes described in the Datasets section.

Datasets

The landscapes and splits in this work are taken from the FLIP benchmark [3]. GB1 is a landscape commonly used for investigating epistasis (interactions between mutations) using the binding domain of protein G, an immunoglobulin binding protein in Streptococcal bacteria. These splits are designed primarily to test generalization from few- to many-mutation sequences. The AAV landscape is based on data collected for the Adeno-associated virus capsid protein, which help the virus integrate a DNA payload into a target cell. The mutations in this landscape are restricted to a subset of positions within a much longer sequence. The Meltome landscape includes data from proteins across 13 different species for a non-protein-specific property (thermostability), so it includes both local and global variations. The total number of data points in the GB1, AAV, and Meltome sets are 8,733, 284,009, and 27,951, respectively. In the AAV set, 82,583 are sampled (mutations) and 201,426 are designed. For AAV, only the 82,583 sampled sequences are used for the Random and 7 vs. Rest tasks, while all 284,009 are used for the Sampled vs. Designed task.

The names of several of the tasks were changed slightly from the original FLIP nomenclature for clarity: GB1/Random was originally called GB1/Sampled, AAV/Random was originally called AAV/Sampled, AAV/7 vs. Rest was originally called AAV/7 vs. Many, AAV/Sampled vs. Designed was originally called AAV/Mut-Des, and Meltome/Random was originally called Meltome/Mixed.

ESM embeddings

We used the pretrained, 650M-parameter ESM-1b model (esm1b_t33_650M_UR50S) from [15] to generate embeddings of the protein sequences in this study and to compare these embeddings to one-hot encoding representations. Sequence embeddings from the final representation layer (layer 33) were mean pooled per amino acid over the length of each protein sequence, which resulted in a fixed embedding size of 1280 for each sequence. In other words, the output of the ESM-1b model is a tensor of size L × 1280, and we averaged over each sequence to obtain a representation vector of size 1280 for each sample.

Base CNN model architectures

The base architecture of all CNN models in this work was taken from the CNNs in the FLIP benchmark [3], which took the architecture from previous work [36]. For the one-hot encoding inputs (with a vocabulary of 22 tokens), this was comprised of a convolution with 1024 output channels and kernel width 5, a ReLU non-linear activation function, a linear mapping to 2048 dimensions, a max pool over the sequence, and a linear mapping to 1 dimension. For ESM embedding inputs (of size 1280), the architecture was the same except with 1280 input channels rather than 1024, and a linear mapping to 2560 dimensions rather than 2048.

CNN Model training procedures

To train our CNN models, we used a batch size of 256 (GB1, AAV) or 30 (Meltome). Adam [37] was used for optimization with the following learning rates: 0.001 for the convolution weights, 0.00005 for the first linear mapping, and 0.000005 for the second linear mapping. Weight decay was set to 0.05 for both the first and second linear mappings. CNNs were trained with early stopping using a patience of 20 epochs. Each model was trained on an NVIDIA Volta V100 GPU. Reported metrics in Figs 24 are the average of training 5 models per split with different seeds for initialization of the CNN parameters and batching / stochastic gradient descent. Code, data, and instructions needed to reproduce results can be found at https://github.com/microsoft/protein-uq.

Uncertainty methods

For all models and landscapes, the sequences were featurized using either one-hot encodings or embeddings from a pretrained language model (see the ESM Embeddings section).

We used the scikit-learn [38] implementation of Bayesian ridge regression (BRR) with default hyperparameters. BRR for one-hot encodings of the Meltome/Random split was not feasible because the required work array was too large to perform the computation with standard 32-bit LAPACK in scipy.

For Gaussian processes (GPs), we used the GPyTorch [39] implementation with the constant mean module, scaled rational quadratic (RQ) kernel covariance module, and Gaussian likelihood. Some GP models (for AAV one-hot encodings and ESM embeddings, and Meltome one-hot encodings) were not feasible to train due to GPU-memory requirements for exact GP models, so these are omitted from the results.

For our uncertainty methods that rely on sampling (dropout, ensemble, and SVI), the final model prediction is defined as the mean of the set of inference samples, and the uncertainty is the standard deviation of these samples. In other words, for a set of predictions E={G1(x),G2(x),,Gn(x)} (each coming from an individual model Gi), the final prediction is defined as

G^(x)=GEG(x)n (1)

and the uncertainty U(x) is defined as

U(x)=GE(G^(x)-G(x))2n (2)

The uncertainty is sometimes defined as the variance U2, but using the standard deviation puts the uncertainty in the same units as the predictions.

For dropout uncertainty [19], a single model G was trained normally. At inference time, we applied n = 10 random dropout masks with dropout probability p to obtain the set of predictions E for each input xi. We tested dropout rates of p ∈ {0.1, 0.2, 0.3, 0.4, 0.5} and reported the model with the lowest miscalibration area.

Similarly for last-layer stochastic variational inference (SVI) [23], we obtained E using n = 10 samples from a set of models where each Gi has the weight and bias terms of its last layer themselves sampled from a distribution q(θ) that has been trained to approximate the true posterior p(θ|D).

Traditional model ensembling calculated E using n = 5 models trained using different random seeds for initialization of the CNN parameters and batching / stochastic gradient descent. The computational cost of this approach is 5 times that of a standard CNN model since the cost scales linearly with the size of the ensemble.

In mean-variance estimation (MVE) models, we adapt the base CNN architecture to produce 2 outputs (θ = {μ, σ2}) for each data point (xi, yi) in the last layer rather than 1, and we train using the negative log-likelihood loss:

L(θ)=1Ni=1N(yi-μ(xi))22σ2(xi)+12log(2πσ2(xi)) (3)

In practice, the variance (σ2) is clamped to a minimum value of 10−6 to prevent division by 0.

Evidential deep learning modifies the loss function of the traditional CNN to jointly maximize the model’s fit to data while also minimizing its evidence on errors (increasing uncertainty on unreliable predictions) [21]:

L(x)=LNLL(x)+λLR(x) (4)

where LNLL(x) is the negative log-likelihood loss defined above, LR(x) is the evidence regularizer as defined in Amini et al. [21], and λ controls the trade-off between these two terms. In this study, we use λ = 1 for all evidential models. In these models, the last layer of the model produces 4 outputs m = {γ, ν, α, β} that parameterize the Normal-Inverse-Gamma distribution. This distribution assumes that targets yi are drawn i.i.d. from a Gaussian distribution with unknown mean and variance θ = {μ, σ2}, where the mean is drawn from a Gaussian and the variance is drawn from an Inverse-Gamma distribution. The output of the evidential model can be divided into the prediction and the epistemic (model) and aleatoric (data) uncertainty components following the analysis of Amini et al. [21]:

E[μ]=γprediction,E[σ2]=βα-1aleatoric,Var[μ]=βν(α-1)epistemic (5)

We report the sum of the aleatoric and epistemic uncertainties as the total uncertainty.

Evaluation metrics

To give a comprehensive report of model accuracy, we computed the following metrics on the test sets: root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R2), and Spearman rank correlation (ρ). RMSE is more sensitive to outliers than MAE, so while both are informative independently, the combination of the two gives additional information about the distribution of errors. R2 and ρ are both unitless and are thus more easily interpreted and compared across datasets.

We evaluated the quality of the uncertainty estimates using four metrics. First, ρunc is the Spearman rank correlation between uncertainty and absolute prediction error. This metric may be particularly relevant in an active learning context, where one wants to acquire labels for the most uncertain points hoping that these are also the highest-error points. This application does not require the uncertainties to be well-calibrated.

Second, the miscalibration area (also called the area under the calibration error curve or AUCE) quantifies the absolute difference between the calibration plot and perfect calibration in a single number [40]. Good calibration may be more important in safety-critical applications.

Following Kompa et al. [41], we measured the coverage as the percentage of true values that fall within the 95% confidence interval (±2σ) of each prediction. This is a indication of the reliability of the uncertainty estimates. A model with high coverage is appropriately cautious in its predictions, which may be most important in applications where safety is a major consideration.

Kompa et al. [41] also define another metric, the width, as the size of the 95% confidence region (4σ). We normalized this width relative to the range (R) of the training set as 4σ/R to make these values unitless and thus more interpretable across datasets. The width is a measure of precision in the uncertainty. In practical applications, narrower intervals (lower width) can help in making more precise and cost-effective decisions. Ideal uncertainties have high coverage and low width, but in some cases, there may be trade-offs between the two. For example, wider widths can help to detect distribution shift, but these wider intervals may not be reliable if coverage is low. The coverage and width metrics may also be more easily interpretable than other calibration metrics [41].

Active learning

Each active learning run began with a random sample of 1%, 5%, or 10% of the full training data, which was taken from the random splits of the three landscapes. We evaluated several alternatives for adding to this initial dataset using different sampling strategies (acquisition functions): explorative greedy, explorative sample, and random. “Explorative greedy” sampled the sequences with the highest uncertainty; “explorative sample” sampled the data according to the probability of sampling a sequence equal to the ratio of its uncertainty to the sum of all uncertainties in the dataset (i.e. random sampling weighted by uncertainty); and “random” sampled the data uniformly from all unobserved sequences. We employed these sampling strategies 5 times in each active learning run, with the 5 training set sizes equally spaced on a log scale. We repeated this process using 3 folds (different random seeds for sampling initial dataset and “explorative sample” probabilities) and calculated the mean and standard deviation across these folds.

Bayesian optimization

Similarly to active learning, each Bayesian optimization run began with a random sample of 10% of the full training data, which was taken from the random splits of the three landscapes. We used the following acquisition functions: greedy, upper-confidence bound (UCB) [42], Thompson sampling (TS) [43], and random. In greedy sampling, the sequence with the best predicted value was selected. The UCB strategy added the predicted uncertainties to the predicted values and selected the sequence with the largest sum. For TS, we added each predicted value to a number sampled randomly from a Gaussian distribution with a mean of 0 and a standard deviation of the corresponding predicted uncertainty, and again selected the largest sum. The “random” strategy used in Bayesian optimization was the same as that used in active learning (new points were sampled with uniform probability). As with active learning, we used these strategies 5 times in each run, with the 5 training set sizes equally spaced on a log scale. We report the mean and standard deviation across 3 folds (different random seeds for sampling the initial dataset).

Supporting information

S1 Appendix. Supporting information.

Code availability, OHE results, OHE vs. ESM comparison, additional prediction and uncertainty evaluation metrics, and additional active learning results.

(PDF)

pcbi.1012639.s001.pdf (4.6MB, pdf)

Acknowledgments

The authors thank the MIT Lincoln Laboratory Supercloud cluster [44] at the Massachusetts Green High Performance Computing Center (MGHPCC) for providing high-performance computing resources to train our machine learning models.

Data Availability

The code for the models, uncertainty methods, and evaluation metrics in this work is available at https://github.com/microsoft/protein-uq and archived at https://zenodo.org/doi/10.5281/zenodo.7839141.

Funding Statement

K.P.G. was supported by a Microsoft Research (https://www.microsoft.com/en-us/research/) micro-internship and by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1745302. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering. Nature Methods. 2019;16(8):687–694. doi: 10.1038/s41592-019-0496-6 [DOI] [PubMed] [Google Scholar]
  • 2.Kendall A, Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/file/2650d6089a6d640c5e85b2b88265dc2b-Paper.pdf.
  • 3.Dallago C, Mou J, Johnston KE, Wittmann BJ, Bhattacharya N, Goldman S, et al.. FLIP: Benchmark tasks in fitness landscape inference for proteins; BioRxiv [Preprint]. 2021. Available from: https://www.biorxiv.org/content/10.1101/2021.11.09.467890v2.
  • 4. Scalia G, Grambow CA, Pernici B, Li YP, Green WH. Evaluating scalable uncertainty estimation methods for deep learning-based molecular property prediction. Journal of Chemical Information and Modeling. 2020;60(6):2697–2717. doi: 10.1021/acs.jcim.9b00975 [DOI] [PubMed] [Google Scholar]
  • 5. Tran K, Neiswanger W, Yoon J, Zhang Q, Xing E, Ulissi ZW. Methods for comparing uncertainty quantifications for material property predictions. Machine Learning: Science and Technology. 2020;1(2):025006. doi: 10.1088/2632-2153/ab7e1a [DOI] [Google Scholar]
  • 6. Hirschfeld L, Swanson K, Yang K, Barzilay R, Coley CW. Uncertainty quantification using neural networks for molecular property prediction. Journal of Chemical Information and Modeling. 2020;60(8):3770–3780. doi: 10.1021/acs.jcim.0c00502 [DOI] [PubMed] [Google Scholar]
  • 7. Nigam A, Pollice R, Hurley MF, Hickman RJ, Aldeghi M, Yoshikawa N, et al. Assigning confidence to molecular property prediction. Expert Opinion on Drug Discovery. 2021;16(9):1009–1023. doi: 10.1080/17460441.2021.1925247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Soleimany AP, Amini A, Goldman S, Rus D, Bhatia SN, Coley CW. Evidential deep learning for guided molecular property prediction and discovery. ACS Central Science. 2021;7(8):1356–1367. doi: 10.1021/acscentsci.1c00546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gruich CJ, Madhavan V, Wang Y, Goldsmith BR. Clarifying trust of materials property predictions using neural networks with distribution-specific uncertainty quantification. Machine Learning: Science and Technology. 2023;4(2):025019. [Google Scholar]
  • 10.Mariet Z, Jerfel G, Wang Z, Angermüller C, Belanger D, Vora S, et al. Deep Uncertainty and the Search for Proteins. In: NeurIPS Workshop: Machine Learning for Molecules; 2020. Available from: https://ml4molecules.github.io/papers2020/ML4Molecules_2020_paper_23.pdf.
  • 11. Hie B, Bryson BD, Berger B. Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. Cell Systems. 2020;11(5):461–477.e9. doi: 10.1016/j.cels.2020.09.007 [DOI] [PubMed] [Google Scholar]
  • 12. Parkinson J, Wang W. Linear-Scaling kernels for protein sequences and small molecules outperform deep learning while providing uncertainty quantitation and improved interpretability. Journal of Chemical Information and Modeling. 2023;63(15):4589–4601. doi: 10.1021/acs.jcim.3c00601 [DOI] [PubMed] [Google Scholar]
  • 13.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; arXiv:1810.04805 [Preprint]. 2019. Available from: https://arxiv.org/abs/1810.04805.
  • 14.Gruver N, Stanton S, Kirichenko P, Finzi M, Maffettone P, Myers V, et al. Effective surrogate models for protein design with bayesian optimization. In: ICML Workshop on Computational Biology; 2021. Available from: https://icml-compbio.github.io/2021/papers/WCBICML2021_paper_61.pdf.
  • 15. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences. 2021;118(15):e2016239118. doi: 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. MacKay DJ. Bayesian interpolation. Neural Computation. 1992;4(3):415–447. doi: 10.1162/neco.1992.4.3.415 [DOI] [Google Scholar]
  • 17. Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1(Jun):211–244. [Google Scholar]
  • 18.Williams CK, Rasmussen CE. Gaussian Processes for Machine Learning. vol. 2. MIT Press Cambridge, MA; 2006.
  • 19.Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Balcan MF, Weinberger KQ, editors. Proceedings of The 33rd International Conference on Machine Learning. vol. 48 of Proceedings of Machine Learning Research. New York, New York, USA: PMLR; 2016. p. 1050–1059. Available from: https://proceedings.mlr.press/v48/gal16.html.
  • 20.Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf.
  • 21.Amini A, Schwarting W, Soleimany A, Rus D. Deep Evidential Regression. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 14927–14937. Available from: https://proceedings.neurips.cc/paper/2020/file/aab085461de182608ee9f607f3f7d18f-Paper.pdf.
  • 22.Nix DA, Weigend AS. Estimating the mean and variance of the target probability distribution. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94). vol. 1. IEEE; 1994. p. 55–60.
  • 23. Hoffman MD, Blei DM, Wang C, Paisley J. Stochastic variational inference. Journal of Machine Learning Research. 2013;. [Google Scholar]
  • 24. Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences. 2013;110(3):E193–E201. doi: 10.1073/pnas.1215251110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Bedbrook CN, Yang KK, Rice AJ, Gradinaru V, Arnold FH. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLOS Computational Biology. 2017;13(10):e1005786. doi: 10.1371/journal.pcbi.1005786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Bedbrook CN, Yang KK, Robinson JE, Mackey ED, Gradinaru V, Arnold FH. Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics. Nature Methods. 2019;16(11):1176–1184. doi: 10.1038/s41592-019-0583-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Greenhalgh JC, Fahlberg SA, Pfleger BF, Romero PA. Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production. Nature Communications. 2021;12(1):5825. doi: 10.1038/s41467-021-25831-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Neal RM. Bayesian learning for neural networks. vol. 118. Springer Science & Business Media; 2012. [Google Scholar]
  • 29. Norinder U, Carlsson L, Boyer S, Eklund M. Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. Journal of Chemical Information and Modeling. 2014;54(6):1596–1603. doi: 10.1021/ci5001168 [DOI] [PubMed] [Google Scholar]
  • 30. Fannjiang C, Bates S, Angelopoulos AN, Listgarten J, Jordan MI. Conformal prediction under feedback covariate shift for biomolecular design. Proceedings of the National Academy of Sciences. 2022;119(43):e2204569119. doi: 10.1073/pnas.2204569119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Levi D, Gispan L, Giladi N, Fetaya E. Evaluating and calibrating uncertainty prediction in regression tasks. Sensors. 2022;22(15):5540. doi: 10.3390/s22155540 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102(477):359–378. doi: 10.1198/016214506000001437 [DOI] [Google Scholar]
  • 33. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–1130. doi: 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
  • 34.Zelikman E, Healy C, Zhou S, Avati A. CRUDE: Calibrating Regression Uncertainty Distributions Empirically; arXiv:2005.12496 [Preprint]. 2021. Available from: https://arxiv.org/abs/2005.12496.
  • 35.Kirsch A, van Amersfoort J, Gal Y. BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning. In: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R, editors. Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc.; 2019. Available from: https://proceedings.neurips.cc/paper_files/paper/2019/file/95323660ed2124450caaac2c46b5ed90-Paper.pdf.
  • 36.Shanehsazzadeh A, Belanger D, Dohan D. Is Transfer Learning Necessary for Protein Landscape Prediction?; arXiv:2011.03443 [Preprint]. 2020. Available from: https://arxiv.org/abs/2011.03443.
  • 37.Kingma DP, Ba J. Adam: A Method for Stochastic Optimization; arXiv:1412.6980 [Preprint]. 2017. Available from: https://arxiv.org/abs/1412.6980.
  • 38. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
  • 39.Gardner J, Pleiss G, Weinberger KQ, Bindel D, Wilson AG. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc.; 2018. Available from: https://proceedings.neurips.cc/paper_files/paper/2018/file/27e8e17134dd7083b050476733207ea1-Paper.pdf.
  • 40.Gustafsson FK, Danelljan M, Schon TB. Evaluating scalable bayesian deep learning methods for robust computer vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops; 2020. p. 318–319.
  • 41. Kompa B, Snoek J, Beam AL. Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures. Entropy. 2021;23(12). doi: 10.3390/e23121608 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Srinivas N, Krause A, Kakade SM, Seeger MW. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting. IEEE Transactions on Information Theory. 2012;58(5):3250–3265. doi: 10.1109/TIT.2011.2182033 [DOI] [Google Scholar]
  • 43.Chapelle O, Li L. An Empirical Evaluation of Thompson Sampling. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, editors. Advances in Neural Information Processing Systems. vol. 24. Curran Associates, Inc.; 2011. Available from: https://proceedings.neurips.cc/paper_files/paper/2011/file/e53a0a2978c28872a4505bdb51db06dc-Paper.pdf.
  • 44.Reuther A, Kepner J, Byun C, Samsi S, Arcand W, Bestor D, et al. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In: 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE; 2018. p. 1–6.
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012639.r001

Decision Letter 0

Nir Ben-Tal, Rachel Kolodny

11 Feb 2024

Dear Yang,

Thank you very much for submitting your manuscript "Benchmarking uncertainty quantification for protein engineering" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, we apologize for the long time it took, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Rachel Kolodny

Academic Editor

PLOS Computational Biology

Nir Ben-Tal

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This paper is about quantifying the ability of different protein sequence-based models to accurately predict uncertainty in different protein design data regimes and modalities. The authors tried difference sequence representations, model architectures and uncertainty quantification methods across FLIP benchmark tasks. Although the amount of experiments involved in this work is impressive, the paper is unfortunately written in a way that mostly lists those dense results, hence making it difficult to get any insight or grasp the relationship between them. Moreover, some of the paper's claims could benefit from better statistical analysis support.

**Major comments**

The paper is results-dense and would benefit from some curation of metrics and/or models that would make it easier to ingest. For example, I suspect that RMSE and correlation are highly correlated as well as calibration plot and uncertainty correlation, but they both have their own distinct figures. Some of these results could be moved to supplementary.

The figures are hard to read or interpret since they tend to have too much information and don’t use the space efficiently by repeating the same axis and legend that end up using all the space. For example, the bar plots in Figure 4 are hard to read, especially the B panel for models that have close to zero correlation that seems to be missing.

Can the authors comment on the apparent strong relationship found between the coverage and width metrics for the same task across different models? It almost points toward models not better quantifying uncertainty per point, but just being more confident in general. It almost feels that a simple re-scaling of the uncertainties on a calibration set from training sets would make those trends and claims vanish.

Claims about models being better than others on task are hard to support without any statistical testing. The fact that using different starting seeds for training the CNN model gives similar performance across Figure 4 almost indicates that some of the trends found in previous figures could also vanish. Also, the active learning experiments with the random acquisition function seem to indicate that a simple re-sampling of the training data could be used to better estimate confidence intervals on metrics and support claims about model performance versus each other. Can the authors re-sample their training data to generate different training data and get a better estimation of their metrics? Can they also choose at least 3-4 different starting points in BO and active learning to not let the starting 10% dictate the conclusion?

Can the authors comment on the expected relationship between the different metrics? It is not clear why the miscalibration curve should be plotted against the RMSE.

**Minor comments**

The authors do cite their uncertainty metrics, but given how central they are to the paper, can they describe them more?

I might have missed these details, but what are the data used for active learning and BO. Is it the designed AAV or just the random? Would it make more sense to start from sampled and then sample from designed? (or maybe it is the case)

In the active learning experiments, are the models getting more confident as they get more details?

A summary table of the different rankings of selected models across tasks would be helpful since the paper has so many results that it is hard to keep track of performance across different regimes and modalities. Ideally, these rankings would be derived from robust statistical testing.

Some curves are not visible in Figure 6 since they are hidden by others.

Reviewer #2: This paper considers the problem of uncertainty prediction for the problem of protein engineering. The paper reports extensive experimentation, with the goal to evaluate different types of predictors, and specifically evaluate their uncertainty predictions. The experiment is conducted in three different types of tasks, with varying degree of domain shift between the training data and test data. The uncertainty prediction is evaluated using different metrics, and through two downstream applications, namely, active learning and Bayesian optimization. The results of the experiments show that:

1. There is no single method that performs better across all tasks.

2. Current uncertainty predictors are not always useful for downstream tasks like active learning (where they are useful in some cases), and Bayesian optimization (where a greedy, uncertainty-agnostic approach performs better).

3. Representing proteins using the ESM language model embeddings rather than one-hot encoding, improves results in some cases but not all.

The main conclusion is that uncertainty predictors cannot be assumed to work out of the box for protein engineering tasks and need to be further developed, and/or carefully evaluated per task.

Strengths:

1. The paper performs an extensive and thorough evaluation of the methods under different varying conditions, and gives both a detailed description of each setup and the big picture of the status of current methods.

2. The conclusion is a useful and practical contribution to the community that often considers relying on such predictions of uncertainty.

Weaknesses:

1. The paper does not present a novel model or evaluation methodology, and is a relatively straightforward implementation of various experiments. In my view this should not prevent the paper from being accepted as the experimentation is thorough and therefore valuable to the community.

2. The results do not show a clear winning method that can directly inform practitioners on ways to improve their research on downstream tasks. However, like I mentioned above, there is value in empirically demonstrating the limitations of current models, which should serve as a warning for practitioners of downstream tasks, and an invitation to researchers to perform more research on uncertainty prediction.

3. There are a few issues that were not clear to me. See questions below. I believe that fixing those issues, would make the paper publishable.

Questions:

1. The representation and CNN architecture is not clear to me. What dimension exactly is being averaged in the ESM embeddings? At the end, what is the dimension of the protein representation both for ESM and one-hot encoding? Is it constant or varying with sequence length? Why do you need a CNN to process this representation, as opposed to a fully connected MLP? Is there some local invariance or smoothness property that should be captured by convolutions? If so over what dimension?

2. It is not clear from the description in section 4.6 that all methods use the same number of samples in evaluation. If this is the case it should be stated, and if not, it should be discussed and justified.

3. Some methods were not fully described, for example it was mentioned that the evidential CNN uses a loss termed L^R, without elaborating.

3. Section 4.7 is confusing in listing the different metrics. First it describes MAE and R^2 metrics which I failed to see in any of the epxeriment results. Second, it states there are four uncertainty metrics but I could only count three (ro_unc, coverage, AUCE). Third, the description of the coverage states that the range is 4\\sigma/R rather than 4\\sigma, however in figure 3 these are shown on different axes (coverage vs. width/R) - I’m not sure what’s going on there.

Some general remarks:

1. The fact that the linear model and GP are performing better than the CNN suggest that maybe there is not enough training data for a deep learning method to work. Are there larger datasets that can be tested?

2. I’m not sure there is a way to do this better than the thorough experiments setup already presented, but in the evaluation through downstream tasks (active learning and Bayesian optimization), there is still some conflation between the quality of the prediction and the quality of the uncertainty predictions. It might be beneficial to try to compare a predictor where the uncertainty is computed exactly from ground truth data. This can serve as some kind of upper bound on the performance.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012639.r003

Decision Letter 1

Nir Ben-Tal, Rachel Kolodny

29 Aug 2024

Dear Yang,

Thank you very much for submitting your manuscript "Benchmarking uncertainty quantification for protein engineering" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we would like to accept this manuscript for publication, but please modify the small changes the reviewer requested. 

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Rachel Kolodny

Academic Editor

PLOS Computational Biology

Nir Ben-Tal

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The author has addressed most of my comments.

Regarding the statistical evidence discussion, I do understand that creating new splits is its own endeavor, and this paper is not about FLIP but about uncertainty quantification. However, since the author already did the hard work of training 5 models per split with different seeds, can they also report the standard deviation for all metrics in the supplementary table. This could help the reader have some context on how different models perform relative to each other in a given context

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012639.r005

Decision Letter 2

Nir Ben-Tal, Rachel Kolodny

14 Nov 2024

Dear Yang,

We are pleased to inform you that your manuscript 'Benchmarking uncertainty quantification for protein engineering' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Rachel Kolodny

Academic Editor

PLOS Computational Biology

Nir Ben-Tal

Section Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012639.r006

Acceptance letter

Nir Ben-Tal, Rachel Kolodny

11 Dec 2024

PCOMPBIOL-D-23-01757R2

Benchmarking uncertainty quantification for protein engineering

Dear Dr Yang,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Dorothy Lannert

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Supporting information.

    Code availability, OHE results, OHE vs. ESM comparison, additional prediction and uncertainty evaluation metrics, and additional active learning results.

    (PDF)

    pcbi.1012639.s001.pdf (4.6MB, pdf)
    Attachment

    Submitted filename: protein_uq_plos_compbio-reviewer_responses.pdf

    pcbi.1012639.s002.pdf (117.2KB, pdf)
    Attachment

    Submitted filename: protein_uq_plos_compbio-reviewer_responses2.pdf

    pcbi.1012639.s003.pdf (127.9KB, pdf)

    Data Availability Statement

    The code for the models, uncertainty methods, and evaluation metrics in this work is available at https://github.com/microsoft/protein-uq and archived at https://zenodo.org/doi/10.5281/zenodo.7839141.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES