Accessible, uniform protein property prediction with a scikit-learn based toolset AIDE

Evan Komp; Kristoffer E Johansson; Nicholas P Gauthier; Japheth E Gado; Kresten Lindorff-Larsen; Gregg T Beckham

doi:10.1093/bioinformatics/btaf544

. 2025 Sep 24;41(10):btaf544. doi: 10.1093/bioinformatics/btaf544

Accessible, uniform protein property prediction with a scikit-learn based toolset AIDE

Evan Komp ^1,², Kristoffer E Johansson ³, Nicholas P Gauthier ^4,⁵, Japheth E Gado ⁶, Kresten Lindorff-Larsen ⁷, Gregg T Beckham ^8,^9,^✉

Editor: Xin Gao

PMCID: PMC12553329 PMID: 40991335

Abstract

Summary

Protein property prediction via machine learning with and without labeled data is becoming increasingly powerful, yet methods are disparate and capabilities vary widely over applications. The software presented here, “Artificial Intelligence Driven protein Estimation (AIDE)”, enables instantiating, optimizing, and testing many zero-shot and supervised property prediction methods for variants and variable length homologs in a single, reproducible notebook or script by defining a modular, standardized application programming interface (API), i.e. drop-in compatible with scikit-learn transformers and pipelines.

Availability and implementation

AIDE is an installable, importable python package inheriting from scikit-learn classes and API and is installable on Windows, Mac, and Linux. Many of the wrapped models internal to AIDE will be effectively inaccessible without a GPU, and some assume CUDA. The newest stable, tested version can be found at https://github.com/beckham-lab/aide_predict and a full user guide and API reference can be found at https://beckham-lab.github.io/aide_predict/. Static versions of both at the time of writing can be found on Zenodo.

1 Introduction

Proteins achieve diverse functions across binding, transport, and catalysis, and they are being leveraged by humans for health and industrial applications (Ellis et al. 2021, Victorino de silva Amatto et al. 2022, Ebrahimi and Samanta 2023). The protein design space is nearly infinite and mostly not functional, making it difficult to navigate (Bank 2022, Ding et al. 2024, Johnston et al. 2024, Park et al. 2024). Researchers inform protein design by a variety of methods, from first principles to data-driven methods, and introduce variation targeting favorable properties with more likelihood than by random mutation (Madhavan et al. 2021). For the purposes of the software presented here, we reduce scope to methods whose parameters are determined by fitting to data, e.g. machine learning (ML).

ML has found use in protein property prediction (PPP) in a many applications (Ferruz et al. 2023, Notin et al. 2024), broadly categorized as supervised ML (SML), where parameters are determined to minimize error of model output on a set of examples with known experimental labels, and “zero-shot” (ZS) learning, where the model parameters are determined for a related task that does not require pre-determined labels and whose outputs may correlate to the property of interest (Hopf et al. 2019, Frazer et al. 2021, Notin et al. 2022a, 2022b, 2023b, Ding et al. 2024). Initially, proteins could be represented to ML algorithms via fixed-length vectors such as amino acid distributions, K-mer counts, structural surface properties, or binary and independent “one-hot” encodings of amino acids (Chen et al. 2022). These features can be given to traditional SML algorithms such as linear models, tree methods, or Gaussian processes, etc (Block et al. 2006, Fox et al. 2007, Romero et al. 2013, Wu et al. 2019). The introduction of self-supervised training in protein language and structure models (PLMs) has expanded the number and diversity of available methods. Such models can operate directly on potentially multiple intrinsic data modes of the protein, and the “learned” embeddings can be used to represent or predict properties of the system (Meier et al. 2021, Elnaggar et al. 2022, Hayes et al. 2025). Many examples of global predictors of, e.g. stability have been trained on this principle (Gado et al. 2025, Komp et al. 2023, Pudžiuvelytė et al. 2024). The reconstruction likelihoods they were initially trained for can also sometimes correlate with a property of interest, expanding available ZS methods beyond alignment scores (Eddy 2011, Meier et al. 2021, Marquet et al. 2022).

While the increase in quantity and capability of PPP methods is encouraging, there are nuances among available tools such as the required inputs, whether they be purely sequence-based, evolutionary homologs, or other data modes (Hopf et al. 2019, Frazer et al. 2021, Dauparas et al. 2022, Su et al. 2024). Given that there is not one method that performs best for every task (e.g. SaProt, which performs best on average in the ProteinGym deep mutational scan ZS benchmarks as of August 2024, is not in the top 5 scorers for 141/217 benchmarks (Notin et al. 2023a, Su et al. 2023)), researchers must attempt to install many different tools and run separate scripts to determine the right tool for their application.

To increase accessibility, reproducibility, and enable comparisons of PPP testing and utilization, here we present a software package Artificial Intelligence Driven protein Estimation (AIDE). The framework defines a scikit-learn derived application programming interface (API) for models in both the SML and ZS categories (Fig. 1) (Pedregosa et al. 2011). We provide a modular, generalized base class for protein models, and wrote protocols for the various model nuances that can be accommodated for new models using Mixins (Bourque et al. 2002). This means that wrapping a given PPP method is rigorously defined and once done, it can be can be imported within a single script or notebook along with other methods, and executed using the same scikit-learn API. Furthermore, disparate models with various dependencies are kept modular, calling subprocesses when necessary, instead of in a single “mega function”, meaning that the library can continue to expand without a growing difficulty of user installation and whole-codebase overhauls. We provide, in a single notebook each, exemplary applications of the library to (1) benchmark ZS models, embedders, and nonlinear models against an epistatic combinatorial dataset, and (2) optimize a supervised prediction pipeline of polyethylene terephthalate hydrolase (PETase) homolog activity at low pH that can be loaded and executed in two lines of code like any scikit-learn pipeline. The software is extensively unit tested with continuous integration at >80% coverage. This software provides researchers a “virtual toolbox” for PPP and we hope increases the visibility of any models added by the community. The full API documentation, roadmap, and user guide is in the Supplementary Data User Guide (UG), available as supplementary data at Bioinformatics online, and at https://beckham-lab.github.io/aide_predict/ with code at: https://github.com/beckham-lab/aide_predict.

Figure 1. — Overview of AIDE. The package supplies a scikit-learn model subclass that operates on a dataclass representing proteins as opposed to numpy matrices, and number of mixin protocols for protein models to adopt common behavior and compatibility checks for user data. These models are drop-in compatible with existing scikit-learn classes and pipelines, allowing for multistep processes to be defined within a single script in a reproducible way like one would for a traditional scikit-learn application.

2 Challenges with the current ecosystem

We suggest that the following issues are faced today in the ecosystem of PPP tools, whether for individual methods use or in tool aggregation:

There is no localized repository of methods that a researcher can use to sandbox their task.
Embedding and ZS methods in the literature use disparate, bespoke interfaces.
Existing tools do not typically adhere to common software engineering principles associated with long lasting, evolving toolsets, such as unit tests.

These challenges mean that when trying multiple methods on a new application, the use of each must be tracked in natural language, without the same reproducibility as one has with a repeatable script. A researcher thus must learn the specific format that each tool requires, and is less likely to find bugs, making community development harder wherein a contributor must first learn a previous researcher’s coding style while simultaneously trying to read the algorithm. Bespoke code also makes extending the toolset difficult from a dependency-management perspective; adding a new method likely involves whole-codebase updates and additional installation dependencies in the top-level module.

There have already been efforts to alleviate parts of the three noted cruxes, but to our knowledge, a single method does not address them all (Dallago et al. 2021, Siedhoff et al. 2021, Wittmann et al. 2021, Chen et al. 2022, Sequeira et al. 2022, Notin et al. 2023a, Funk et al. 2024, Su et al. 2024). While these toolsets are all useful, very few exhibit high test coverage, if present at all. Most importantly, each defines a unique API and environment that must be reworked to extend the toolset.

3 Software descriptor

Notably, the disparity in code architecture and model internals is the same challenge overcome by scikit-learn, a unifying ML package where a user can, e.g. test out a ridge regressor and a random forest regressor—very different algorithms—on their data in the same way (Pedregosa et al. 2011). Scikit-learn leverages an intuitive init-fit-transform-score API as a gateway to ML, and nearly all ML practitioners use it. The presented software operates on a protein data structures and protein models just as scikit-learn does with fixed-length vectors (Fig. 1). Given this, AIDE is built on and drop-in compatible with scikit-learn pipelines.

3.1 Core classes

The software centers around two types of classes: data structures like ProteinSequence, ProteinSequences, and ProteinStructure, as well as model classes deriving from ProteinModelWrapper. The data structures allow for quick input/output of protein structures, sequences, and multiple sequence alignments. They also expose helpful methods such as inducing mutations, checking for differences between two sequences, generating mutagenesis libraries to pass to models, and attributes such as whether it is gap-containing. These data structures are analogous to an X array in standard scikit-learn applications. The ProteinModelWrapper accepts these ProteinSequences as inputs. It can act as a transformer or a regressor depending on how it is inherited, and leverages the expected API of parameters to initialization, then fit-transform-score. Wrapping a new method into AIDE as a ProteinModelWrapper requires defining only the core behavior of the model, while existing Mixins can be added to handle commonalties exhibited in the literature, see the UG Section 2.5, available as supplementary data at Bioinformatics online. The Mixins check and handle the inputs and outputs for the expected format, such that the data are already converted to a format compatible for the tool. Definitions must be done only once, then the model can be used in scripts and notebooks. Example 6 in the UG Section 2.2, available as supplementary data at Bioinformatics online, depicts a wrapped model and UG Section 2.12, available as supplementary data at Bioinformatics online, provides a guide to doing so. Some common Mixins that a particular model may use are:

RequiresMSAForFitMixin when mixed in for a bespoke model, AIDE will check that ProteinSequences coming into the fit method are aligned, and if not, it will attempt to align them.
RequiresStructureMixin AIDE will ensure that incoming ProteinSequences has associated structures falling back to a wild-type structure if present and issue a warning.
RequiresWTToFunctionMixin AIDE will require that a wild-type sequence be available within the model.

A full list of current Mixins is given in the UG Section 2.5.7, available as supplementary data at Bioinformatics online. Undoubtedly, more Mixins will be required as the PPP field progresses, and we focused on modularity to accommodate this.

3.2 Supporting functions

Various utilities are provided to both expand functionality and reduce user friction. For example, functions are provided to check which ProteinModels are installed, and which are compatible with the available data from the user. Others sample multiple sequence alignments (MSAs), align sequences in place, and map structures. We also provide tools for common tasks on either side of PPP, like predicting structures with Soloseq, creating MSAs with MMseqs2, or sampling designed sequences to maximize predictions with BADASS (Steinegger and Söding 2017, Ahdritz et al. 2024, Gomez-Uribe et al. 2024).

3.3 Currently supported tools and installation

In AIDE, we separate methods into their own modules and implement checks for whether they are available based on their dependencies. The PPP methods currently wrapped in the API are separated into three categories: those that are functional with only a lightweight dependency list of the base package, those that require additional pip installable dependencies, and those that require more setup such as cloning a repository and following installation instructions. This avoids requiring users to run environment integrations to use multiple toolsets simultaneously and instead install just the modules they would like, and a single package environment does not become too heavy to solve. For example, hidden Markov models (HMMs) and one-hot encodings are accessible with the base module, while ESM and a few others require the addition of PyTorch (Eddy 2011, Paszke et al. 2019, Meier et al. 2021). To access them, the user installs the additional “requirements-transformers.txt”. For EVE, the user is instructed to setup the original work’s EVE environment and set an environment variable (Frazer et al. 2021). To use EVE, AIDE then calls it as a subprocess, such that compatibility between other AIDE models and EVE is not an issue.

The currently available methods to the user based on their environment and their data can be checked from within AIDE via a function call. The full list of currently wrapped tools is given in Table 1 (Eddy 2011, Hopf et al. 2017, Frazer et al. 2021, Meier et al. 2021, Rao et al. 2021, Marquet et al. 2022, Su et al. 2023, Blaabjerg et al. 2024). Note that this is an initial list to serve as a proof of concept, and we plan to add, along with any community contributors, additional models, and data modes.

Table 1.

Currently wrapped tools in AIDE.

Class	Type	Dependencies	Description
OneHotProteinEmbedding	Emb	–	20 AA binary indicators for a fixed sequence length
OneHotAlignedEmbedding	Emb	–	20 AA + gap indicators of an MSA, new sequences are aligned to MSA
KMerEmbedding	Emb	–	K-mer counts of specified order
HMMWrapper	ZS	–	likelihood scores compared to a set of related sequences assuming independent column wise frequencies
EVMutationWrapper	ZS	Evcouplings	Hamiltonian scores of pairwise evolutionary couplings compared to a wild type sequence
ESM2Embeddings	Emb	Transformers	Position specific embeddings from ESM2 PLM
ESM2LikelihoodWrapper	ZS	Transformers	Masked, wild type, mutant marginal likelihood from ESM2
SaProtEmbedding	Emb	Transformers, foldseek	Position specific embeddings from sequence and structure tokens from SaProt PLM
SaProtLikelihoodWrapper	ZS	Transformers, foldseek	Masked, wild type, or mutant marginal likelihood of sequence tokens given sequence and structure
MSATransformerEmbedding	Emb	Transformers, fair-esm	Position specific embeddings conditioned on a whole or sampled MSA
MSATransformerLikelihoodWrapper	ZS	Transformers, fair-esm	Masked, wild type, or mutant marginal likelihood of sequence tokens given MSA tokens
VESPA	ZS	Vespa-effect	Single substitution mutation scores given conservation predictions of a model trained on PLM embeddings and MSAs
EVE	ZS	^a	Variant affect predication via VAE trained on MSA
SSEmbedder	Emb	^a	Structure constrained MSA language model final layer
SSEmbPredictor	ZS	^a	Structure constrained MSA language model additive log-likelihood

Open in a new tab

The tool requires building the original independent environment.

3.4 Unit tests

All base classes and modules were extensively unit tested. We cannot ensure that wrapped methods are tested properly; however, we ensure that while the interpreter is within AIDE, it is executing tested code, and that the wrapped methods evaluated on the ProteinGym benchmark produce outputs matching the reported score for the ENV_ECOLI (Notin et al. 2023a). As of September 2025, AIDE has >80% code test coverage.

4 Showcases

We provide two illustrative cases for using AIDE to efficiently compare ML strategies, with all of the typical tasks such as cross validation and hyperparameter optimization. We do not intend to highlight a particular scientific finding for these systems or claim that they extend to other systems, but rather highlight the work that can be accomplished with relatively little coding in a reproducible manner. See Appendices A and B, available as supplementary data at Bioinformatics online, for a notebook of the complete details and execution of each task.

4.1 Showcase 1: benchmarking on an epistatic combinatorial dataset

We optimized and tested several prediction strategies on a recent, highly-epistatic, 4-site combinatorial enzyme activity dataset by Johnston et al. (2024). Similar to work by Hsu et al., we tested “augmenting” ZS scorers with supervised predictors, and we include a few ZS scorers not explored in the original work, namely MSATransformer and ESM2 (Hsu et al. 2022, Meier et al. 2021, Hayes et al. 2025). We also test embedding strategies beyond one-hot encoding: ESM embeddings mean pooled over the whole sequence or just the four variable sites. Hyperparameter optimization was conducted for each strategy in five-fold cross validation. Embedders were defined and integrated into scikit-learn pipelines including feature and target scaling, principal component analysis, and predictor heads. We do this in a single jupyter notebook executed on a laptop with less than an 8-h runtime, which can be found in the repository or in PDF form in Appendix A, available as supplementary data at Bioinformatics online. The results of this comparison are shown in Fig. 2.

Figure 2. — Performance of various models on four-site epistatic combinatorial dataset by Johnston *et al.* as a function of dataset size (Johnston *et al.* 2024). Embeddings for the supervised model include one-hot encoding (ohe), ESM2 mean pooling over the entire protein sequence (full), and only the four variable positions (four sites) (Meier *et al.* 2021). (A) Pure supervised only (“Sup.” dashed line with color) versus Zero Shot (ZS, grey) and zero shot augmented (“Aug.”, solid line with color) models with EVCouplings scores (Hopf *et al.* 2019). Augmentation improves performance of one-hot encoding at low training data but does not affect ESM based embedding and is negligible for >500 training points. (B) Top 10 recovery defined as the fraction of the 10 true best recovered in a final set of 96 chosen using linear versus nonlinear pure supervised models, where at about 1000 training examples, nonlinear models begin to significantly outperform linear ones.

4.2 Showcase 2: PETase activity predictor for scraping natural sequences

We recently conducted a large search for remote polyethylene terephthalate hydrolase (PETase) homologs to identify novel enzymes (Norton-Baker et al. 2025). Here, we use this dataset to produce an activity predictor for future searches. We tested several embedding strategies: one-hot encoding of PETase alignments, ESM2 pooled embeddings, SaProt (structure aware) embeddings using AF2 predicted structures, and MSATransformer embeddings using an MSA of 61 known PETases (Meier et al. 2021, Rao et al. 2021, Su et al. 2023). We tested these with a linear model head and a nonlinear random forest head. For each combination, we conducted five-fold cross validation and random hyperparameter optimization. The results are compared to an HMM score built from known PETases (Fig. 3). This is conducted in a single jupyter notebook on a laptop in less than 4-h runtime, which can be found in the code repository. A PDF of the notebook is also provided in Appendix B, available as supplementary data at Bioinformatics online.

Figure 3. — Performance of various combinations of embeddings strategy and predictor head for PETases homolog activity prediction. PET hydrolysis activity measured at pH = 5.5, 40°C (Norton-Baker *et al.* 2025). Horizontal lines are “null” model of using only an HMM score (cyan for spearman correlation to measured activity, salmon for area under the receiver operator curve “roc_auc” of nonzero measured activity). Null model is incapable of predicting active versus inactive PET homologs. Bars are five-fold CV scores after hyperparameter optimization of linear and nonlinear random forest models against four embedding strategies for PETase activity prediction at pH = 5.5, 40°C: ESM mean pooling, SaProt mean pooling, MSATransformer flattened embedding of query sequence, and one-hot encoding of a held out alignment (Meier *et al.* 2021, Rao *et al.* 2021, Su *et al.* 2024).

4.3 Examples

In addition to the showcases, we provide several executable code examples to highlight the conciseness of readability of the API, given in the “demo” folder of the code repository and in the UG Section 2.2, available as supplementary data at Bioinformatics online, including checking which models are compatible with user data, in silico mutagenesis, training a global supervised WT predictor, combining ZS and supervised methods into a single scikit-learn pipeline, and wrapping a new method into the API.

5 Conclusion

We have written a set of scikit-learn-derived python classes that unify the use of protein models for both embedding and ZS prediction. Models are loaded and called in the same init-fit-transform-score API as users are familiar with in scikit-learn (Pedregosa et al. 2011). Common workflows associated with input and output requirements were codified, making adding new models straightforward and defined. This enables rapid comparison to user data to determine which methods are compatible, and execution of them in reproducible scripts. The open-source package is unit tested, and we anticipate will enable creating and testing reproducible protein property prediction pipelines more accessible.

Supplementary Material

btaf544_Supplementary_Data

btaf544_supplementary_data.zip^{(3.8MB, zip)}

Contributor Information

Evan Komp, Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden Colorado, CO 80401, United States; Agile BioFoundry, Emeryville, CA 94608, United States.

Kristoffer E Johansson, Linderstrøm-Lang Centre for Protein Science, Section for Biomolecular Sciences, Department of Biology, University of Copenhagen, Copenhagen, Denmark.

Nicholas P Gauthier, Department of Systems Biology, Harvard Medical School, Boston, MA 02115, United States; Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02115, United States.

Japheth E Gado, Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden Colorado, CO 80401, United States.

Kresten Lindorff-Larsen, Linderstrøm-Lang Centre for Protein Science, Section for Biomolecular Sciences, Department of Biology, University of Copenhagen, Copenhagen, Denmark.

Gregg T Beckham, Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden Colorado, CO 80401, United States; Agile BioFoundry, Emeryville, CA 94608, United States.

Author contributions

Evan Komp (Conceptualization [equal], Methodology [equal], Software [equal], Writing—original draft [equal], Writing—review & editing [equal]), Kristoffer E. Johansson (Methodology [equal], Software [equal], Writing—original draft [equal], Writing—review & editing [equal]), Nicholas P. Gauthier (Methodology [equal], Writing—original draft [equal], Writing—review & editing [equal]), Japheth E. Gado (Methodology [equal], Writing—original draft [equal], Writing—review & editing [equal]), Kresten Lindorff-Larsen (Methodology [equal], Writing—original draft [equal], Writing—review & editing [equal]), and Gregg T. Beckham (Conceptualization [equal], Writing—original draft [equal], Writing—review & editing [equal])

Supplementary data

Supplementary data is available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was authored in part by the National Renewable Energy Laboratory for the US Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding to E.K. and G.T.B. was provided by the US Department of Energy Office of Science Biological and Environmental Research via DE-SC0023278. Partial funding to E.K. and G.T.B. was also provided by the US Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office (BETO) for the Agile BioFoundry. This material is also based upon work supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomic Science Program under Award Number DE-SC0022024 to N.P.G. The views expressed herein do not necessarily represent the views of the DOE or the US Government. The US Government retains, and the publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for US Government purposes. K.E.J and K.L.L. acknowledge support by the Novo Nordisk Foundation centre PRISM (NNF18OC0033950).

Data availability

See Github (https://github.com/beckham-lab/aide_predict/) to access the latest version software, or on Google Colab (https://colab.research.google.com/drive/1baz4DdYkxaw6pPRTDscwh2o-Xqum5Krp). The state of the software at the time of writing is given in version v1.1.03 on Zenodo (Komp and Beckham 2025). A full user guide and API reference can be found at https://beckham-lab.github.io/aide_predict/. Data files for Figs 2 and 3 are available in the Supplementary Data, available as supplementary data at Bioinformatics online.

References

Ahdritz G, Bouatta N, Floristean C et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024;21:1514–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bank C. Epistasis and adaptation on fitness landscapes. Annu Rev Ecol Evol Syst 2022;53:457–79. [Google Scholar]
Blaabjerg LM, Jonsson N, Boomsma W et al. SSEmb: a joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun 2024;15:9646. [DOI] [PMC free article] [PubMed] [Google Scholar]
Block P, Paern J, Hüllermeier E et al. Physicochemical descriptors to discriminate protein–protein interactions in permanent and transient complexes selected by means of machine learning algorithms. Proteins Struct Funct Bioinf 2006;65:607–22. [DOI] [PubMed] [Google Scholar]
Bourque P, Dupuis R, Abran A et al. Fundamental principles of software engineering—a journey. J Syst Softw 2002;62:59–70. [Google Scholar]
Chen Z, Liu X, Zhao P et al. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Research 2022;50:W434–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dallago C, Schütze K, Heinzinger M et al. Learned embeddings from deep learning to visualize and predict protein sets. Curr Protoc 2021;1:e113. [DOI] [PubMed] [Google Scholar]
Dauparas J, Anishchenko I, Bennett N et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022;378:49–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ding D, Shaw AY, Sinai S et al. Protein design using structure-based residue preferences. Nat Commun 2024;15:1639. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ebrahimi SB, Samanta D. Engineering protein-based therapeutics through structural and chemical design. Nat Commun 2023;14:2411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy SR. Accelerated profile HMM searches. PLOS Comput Biol 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ellis LD, Rorrer NA, Sullivan KP et al. Chemical and biological catalysis for plastics recycling and upcycling. Nat Catal 2021;4:539–56. [Google Scholar]
Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. [DOI] [PubMed] [Google Scholar]
Ferruz N, Heinzinger M, Akdel M et al. From sequence to function through structure: deep learning for protein design. Comput Struct Biotechnol J 2023;21:238–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fox RJ, Davis SC, Mundorff EC et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat Biotechnol 2007;25:338–44. [DOI] [PubMed] [Google Scholar]
Frazer J, Notin P, Dias M et al. Disease variant prediction with deep generative models of evolutionary data. Nature 2021;599:91–5. [DOI] [PubMed] [Google Scholar]
Funk J, Machado L, Bradley SA et al. ProteusAI: an open-source and user-friendly platform for machine learning-guided protein design and engineering. arXiv:2024.10.01.616114, 2024, preprint: not peer reviewed.
Gado JE, Knotts M, Shaw AY et al. Machine learning prediction of enzyme optimum ph. Nat Mach Intell 2025;7:716–29. 10.1038/s42256-025-01026-6 [DOI] [Google Scholar]
Gomez-Uribe C, Gado J, Islamov M. Designing diverse and high-performance proteins with a large language model in the loop. arXiv:2024.10.25.620340, 2024, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
Hayes T, Rao R, Akin H et al. Simulating 500 million years of evolution with a language model. Science 2025;387:850–8. 10.1126/science.ads0018 [DOI] [PubMed] [Google Scholar]
Hopf TA, Green AG, Schubert B et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 2019;35:1582–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hopf TA, Ingraham JB, Poelwijk FJ et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 2017;35:128–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsu C, Nisonoff H, Fannjiang C et al. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022;40:1114–22. [DOI] [PubMed] [Google Scholar]
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ et al. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proc Natl Acad Sci 2024;121:e2400439121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Komp E, Alanzi HN, Francis R et al. Homologous pairs of low and high temperature originating proteins spanning the known prokaryotic universe. Sci Data 2023;10:682. [DOI] [PMC free article] [PubMed] [Google Scholar]
Komp E, Beckham GT. beckham-lab/aide_predict. 2025. 10.5281/zenodo.16986183 [DOI]
Madhavan A, Arun KB, Binod P et al. Design of novel enzyme biocatalysts for industrial bioprocess: harnessing the power of protein engineering, high throughput screening and synthetic biology. Bioresour Technol 2021;325:124617. [DOI] [PubMed] [Google Scholar]
Marquet C, Heinzinger M, Olenyi T et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet 2022;141:1629–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meier J, Rao R, Verkuil R et al. Language models enable zero-shot prediction of the effects of mutations on protein function. 2021. 10.1101/2021.07.09.450648 [DOI]
Norton-Baker B, Komp E, Gado J et al. Activity across temperature and pH of PET hydrolase candidates. [Data set]. Zenodo. 2025. 10.5281/zenodo.15417757 [DOI]
Norton-Baker B, Komp E, Gado JE et al. Machine learning-guided identification of PET hydrolases from natural diversity. ACS Catal 2025;15:16070–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
Notin P, Dias M, Frazer J et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. arXiv, 10.48550/arXiv.2205.13760, 2022. a, preprint: not peer reviewed. [DOI]
Notin P, Kollasch AW, Ritter D et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. bioRxiv, 2023.12.07.570727, 2023. a, preprint: not peer reviewed.
Notin P, Niekerk LV, Kollasch AW et al. TranceptEVE: combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. arXiv:2022.12.07.519495, 2022. b, preprint: not peer reviewed.
Notin P, Rollins N, Gal Y et al. Machine learning for functional protein design. Nat Biotechnol 2024;42:216–28. [DOI] [PubMed] [Google Scholar]
Notin P, Weitzman R, Marks DS et al. ProteinNPT: improving protein property prediction and design with non-parametric transformers. arXiv:2023.12.06.570473, 2023. b, preprint: not peer reviewed.
Park Y, Metzger BPH, Thornton JW. The simplicity of protein sequence–function relationships. bioRxiv, 2023.09.02.556057, 2024, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
Paszke A, Gross S, Massa F et al. PyTorch: an imperative style, high-performance deep learning library, arXiv, 10.48550/arXiv.1912.01703, 2019, preprint: not peer reviewed. [DOI]
Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in Python. arXiv, 1201.0490, 2011, preprint: not peer reviewed.
Pudžiuvelytė I, Olechnovič K, Godliauskaite E et al. TemStaPro: protein thermostability prediction using sequence representations from protein language models. Bioinformatics 2024;40:btae157. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rao R, Liu J, Verkuil R et al. MSA transformer. arXiv, 2021.02.12.430858, 2021, preprint: not peer reviewed.
Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes. Proc Natl Acad Sci 2013;110:E193–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sequeira AM, Lousa D, Rocha M. ProPythia: a python package for protein classification based on machine and deep learning. Neurocomputing (Amst) 2022;484:172–82. [Google Scholar]
Siedhoff NE, Illig A-M, Schwaneberg U et al. PyPEF—an integrated framework for data-driven protein engineering. J Chem Inf Model 2021;61:3463–76. [DOI] [PubMed] [Google Scholar]
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]
Su J, Han C, Zhou Y et al. SaProt: protein language modeling with structure-aware vocabulary. arXiv, 2023.10.01.560349, 2023, preprint: not peer reviewed.
Su J, Li Z, Han C et al. SaprotHub: making protein modeling accessible to all biologists. arXiv: 2024.05.24.595648, 2024, preprint: not peer reviewed.
Victorino de Silva Amatto I, Gonsales da Rosa-Garzon N, Antônio de Oliveira Simões F et al. Enzyme engineering and its industrial applications. Biotechnol Appl Biochem 2022;69:389–409. [DOI] [PubMed] [Google Scholar]
Wittmann BJ, Yue Y, Arnold FH. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst 2021;12:1026–45.e7. [DOI] [PubMed] [Google Scholar]
Wu Z, Kan SBJ, Lewis RD et al. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc Natl Acad Sci U S A 2019;116:8852–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Komp E, Beckham GT. beckham-lab/aide_predict. 2025. 10.5281/zenodo.16986183 [DOI]
Norton-Baker B, Komp E, Gado J et al. Activity across temperature and pH of PET hydrolase candidates. [Data set]. Zenodo. 2025. 10.5281/zenodo.15417757 [DOI]

Supplementary Materials

btaf544_Supplementary_Data

btaf544_supplementary_data.zip^{(3.8MB, zip)}

Data Availability Statement

[btaf544-B1] Ahdritz G, Bouatta N, Floristean C et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024;21:1514–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B2] Bank C. Epistasis and adaptation on fitness landscapes. Annu Rev Ecol Evol Syst 2022;53:457–79. [Google Scholar]

[btaf544-B3] Blaabjerg LM, Jonsson N, Boomsma W et al. SSEmb: a joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun 2024;15:9646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B4] Block P, Paern J, Hüllermeier E et al. Physicochemical descriptors to discriminate protein–protein interactions in permanent and transient complexes selected by means of machine learning algorithms. Proteins Struct Funct Bioinf 2006;65:607–22. [DOI] [PubMed] [Google Scholar]

[btaf544-B5] Bourque P, Dupuis R, Abran A et al. Fundamental principles of software engineering—a journey. J Syst Softw 2002;62:59–70. [Google Scholar]

[btaf544-B6] Chen Z, Liu X, Zhao P et al. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Research 2022;50:W434–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B7] Dallago C, Schütze K, Heinzinger M et al. Learned embeddings from deep learning to visualize and predict protein sets. Curr Protoc 2021;1:e113. [DOI] [PubMed] [Google Scholar]

[btaf544-B8] Dauparas J, Anishchenko I, Bennett N et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022;378:49–56. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B9] Ding D, Shaw AY, Sinai S et al. Protein design using structure-based residue preferences. Nat Commun 2024;15:1639. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B10] Ebrahimi SB, Samanta D. Engineering protein-based therapeutics through structural and chemical design. Nat Commun 2023;14:2411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B11] Eddy SR. Accelerated profile HMM searches. PLOS Comput Biol 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B12] Ellis LD, Rorrer NA, Sullivan KP et al. Chemical and biological catalysis for plastics recycling and upcycling. Nat Catal 2021;4:539–56. [Google Scholar]

[btaf544-B13] Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. [DOI] [PubMed] [Google Scholar]

[btaf544-B14] Ferruz N, Heinzinger M, Akdel M et al. From sequence to function through structure: deep learning for protein design. Comput Struct Biotechnol J 2023;21:238–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B15] Fox RJ, Davis SC, Mundorff EC et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat Biotechnol 2007;25:338–44. [DOI] [PubMed] [Google Scholar]

[btaf544-B16] Frazer J, Notin P, Dias M et al. Disease variant prediction with deep generative models of evolutionary data. Nature 2021;599:91–5. [DOI] [PubMed] [Google Scholar]

[btaf544-B17] Funk J, Machado L, Bradley SA et al. ProteusAI: an open-source and user-friendly platform for machine learning-guided protein design and engineering. arXiv:2024.10.01.616114, 2024, preprint: not peer reviewed.

[btaf544-B18] Gado JE, Knotts M, Shaw AY et al. Machine learning prediction of enzyme optimum ph. Nat Mach Intell 2025;7:716–29. 10.1038/s42256-025-01026-6 [DOI] [Google Scholar]

[btaf544-B19] Gomez-Uribe C, Gado J, Islamov M. Designing diverse and high-performance proteins with a large language model in the loop. arXiv:2024.10.25.620340, 2024, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]

[btaf544-B20] Hayes T, Rao R, Akin H et al. Simulating 500 million years of evolution with a language model. Science 2025;387:850–8. 10.1126/science.ads0018 [DOI] [PubMed] [Google Scholar]

[btaf544-B21] Hopf TA, Green AG, Schubert B et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 2019;35:1582–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B22] Hopf TA, Ingraham JB, Poelwijk FJ et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 2017;35:128–35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B222] Hsu C, Nisonoff H, Fannjiang C et al. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022;40:1114–22. [DOI] [PubMed] [Google Scholar]

[btaf544-B23] Johnston KE, Almhjell PJ, Watkins-Dulaney EJ et al. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proc Natl Acad Sci 2024;121:e2400439121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B24] Komp E, Alanzi HN, Francis R et al. Homologous pairs of low and high temperature originating proteins spanning the known prokaryotic universe. Sci Data 2023;10:682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B25] Komp E, Beckham GT. beckham-lab/aide_predict. 2025. 10.5281/zenodo.16986183 [DOI]

[btaf544-B26] Madhavan A, Arun KB, Binod P et al. Design of novel enzyme biocatalysts for industrial bioprocess: harnessing the power of protein engineering, high throughput screening and synthetic biology. Bioresour Technol 2021;325:124617. [DOI] [PubMed] [Google Scholar]

[btaf544-B27] Marquet C, Heinzinger M, Olenyi T et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet 2022;141:1629–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B28] Meier J, Rao R, Verkuil R et al. Language models enable zero-shot prediction of the effects of mutations on protein function. 2021. 10.1101/2021.07.09.450648 [DOI]

[btaf544-B29] Norton-Baker B, Komp E, Gado J et al. Activity across temperature and pH of PET hydrolase candidates. [Data set]. Zenodo. 2025. 10.5281/zenodo.15417757 [DOI]

[btaf544-B30] Norton-Baker B, Komp E, Gado JE et al. Machine learning-guided identification of PET hydrolases from natural diversity. ACS Catal 2025;15:16070–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B31] Notin P, Dias M, Frazer J et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. arXiv, 10.48550/arXiv.2205.13760, 2022. a, preprint: not peer reviewed. [DOI]

[btaf544-B32] Notin P, Kollasch AW, Ritter D et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. bioRxiv, 2023.12.07.570727, 2023. a, preprint: not peer reviewed.

[btaf544-B33] Notin P, Niekerk LV, Kollasch AW et al. TranceptEVE: combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. arXiv:2022.12.07.519495, 2022. b, preprint: not peer reviewed.

[btaf544-B34] Notin P, Rollins N, Gal Y et al. Machine learning for functional protein design. Nat Biotechnol 2024;42:216–28. [DOI] [PubMed] [Google Scholar]

[btaf544-B35] Notin P, Weitzman R, Marks DS et al. ProteinNPT: improving protein property prediction and design with non-parametric transformers. arXiv:2023.12.06.570473, 2023. b, preprint: not peer reviewed.

[btaf544-B36] Park Y, Metzger BPH, Thornton JW. The simplicity of protein sequence–function relationships. bioRxiv, 2023.09.02.556057, 2024, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]

[btaf544-B37] Paszke A, Gross S, Massa F et al. PyTorch: an imperative style, high-performance deep learning library, arXiv, 10.48550/arXiv.1912.01703, 2019, preprint: not peer reviewed. [DOI]

[btaf544-B38] Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in Python. arXiv, 1201.0490, 2011, preprint: not peer reviewed.

[btaf544-B39] Pudžiuvelytė I, Olechnovič K, Godliauskaite E et al. TemStaPro: protein thermostability prediction using sequence representations from protein language models. Bioinformatics 2024;40:btae157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B40] Rao R, Liu J, Verkuil R et al. MSA transformer. arXiv, 2021.02.12.430858, 2021, preprint: not peer reviewed.

[btaf544-B41] Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes. Proc Natl Acad Sci 2013;110:E193–201. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaf544-B42] Sequeira AM, Lousa D, Rocha M. ProPythia: a python package for protein classification based on machine and deep learning. Neurocomputing (Amst) 2022;484:172–82. [Google Scholar]

[btaf544-B43] Siedhoff NE, Illig A-M, Schwaneberg U et al. PyPEF—an integrated framework for data-driven protein engineering. J Chem Inf Model 2021;61:3463–76. [DOI] [PubMed] [Google Scholar]

[btaf544-B44] Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017;35:1026–8. [DOI] [PubMed] [Google Scholar]

[btaf544-B45] Su J, Han C, Zhou Y et al. SaProt: protein language modeling with structure-aware vocabulary. arXiv, 2023.10.01.560349, 2023, preprint: not peer reviewed.

[btaf544-B46] Su J, Li Z, Han C et al. SaprotHub: making protein modeling accessible to all biologists. arXiv: 2024.05.24.595648, 2024, preprint: not peer reviewed.

[btaf544-B47] Victorino de Silva Amatto I, Gonsales da Rosa-Garzon N, Antônio de Oliveira Simões F et al. Enzyme engineering and its industrial applications. Biotechnol Appl Biochem 2022;69:389–409. [DOI] [PubMed] [Google Scholar]

[btaf544-B48] Wittmann BJ, Yue Y, Arnold FH. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst 2021;12:1026–45.e7. [DOI] [PubMed] [Google Scholar]

[btaf544-B49] Wu Z, Kan SBJ, Lewis RD et al. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc Natl Acad Sci U S A 2019;116:8852–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Accessible, uniform protein property prediction with a scikit-learn based toolset AIDE

Evan Komp

Kristoffer E Johansson

Nicholas P Gauthier

Japheth E Gado

Kresten Lindorff-Larsen

Gregg T Beckham

Roles

Abstract

Summary

Availability and implementation

1 Introduction

Figure 1.

2 Challenges with the current ecosystem

3 Software descriptor

3.1 Core classes

3.2 Supporting functions

3.3 Currently supported tools and installation

Table 1.

3.4 Unit tests

4 Showcases

4.1 Showcase 1: benchmarking on an epistatic combinatorial dataset

Figure 2.

4.2 Showcase 2: PETase activity predictor for scraping natural sequences

Figure 3.

4.3 Examples

5 Conclusion

Supplementary Material

Contributor Information

Author contributions

Supplementary data

Funding

Data availability

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases