Skip to main content
. 2023 Oct 13;13(21):13863–13895. doi: 10.1021/acscatal.3c02743

Table 1. Glossary of Terms Used Frequently in the Context of Machine Learning for Enzyme Engineeringa.

accuracy metric primarily used for classification tasks, measuring the ratio of correct predictions to all predictions produced by a machine learning model
active learning a type of machine learning in which the learning algorithm queries a person for providing labels for particular data points during training iteratively. the first iteration usually starts with many unlabeled and few labeled data points, e.g., protein sequences. after training on this data set, the algorithm proposes a next set of data points for labeling to the experimenter, e.g., more sequences to be characterized in the lab. their labels are then provided to the algorithm for the next iteration, and the cycle repeats several times
artificial intelligence (AI) artificial intelligence as defined by McCarthy is “the science and engineering of making intelligent machines, especially intelligent computer programs”
continual learning a concept in which a model can train on new data while maintaining the abilities acquired from earlier training on old data
cross-validation (K-fold) an approach to evaluating the performance of a model whereby a data set is split into K parts, the model is retrained K times on all but one part, and the performance is evaluated on the excluded part. this way each data point is used once for validation, and K different evaluations are produced to provide a distribution of the values
deep learning (DL) a branch of machine learning that uses multiple-layer neural network architectures. deep networks generally include many more parameters (sometimes, billions) and hyperparameters than traditional machine learning models. this gives deep neural networks tremendous expressive power and design flexibility, which has made them a major driver of modern technology with applications ranging from on-the-fly text generation to protein structure prediction
diffusion deep learning paradigm based on denoising diffusion probabilistic modeling. diffusion models learn to generate novel objects, e.g., images or proteins, by reconstructing artificially corrupted training examples
embedding representation of high-dimensional data, e.g., text, images, or proteins, in a lower-dimensional vector space while preserving important information
end-to-end learning a type of ML that requires minimal to no data transformation (e.g., just one-hot encoding of the input) to train a predictor. This is often the case in deep learning, when abundant data are available to establish direct input-to-output correspondence, in contrast to classical ML approaches using small data sets, which typically require feature and label engineering before the data can be used for training
equivariance an ML model is said to be equivariant with respect to a particular transformation if the order of applying the transformation and the model to an input does not change the outcome. for example, if we pass a rotated input to a model that is equivariant to rotation, the result will be the same as if the model was applied to the original input and the output was then rotated
explainable artificial intelligence (XAI) AI or ML-based algorithms designed such that humans can understand the reasons for their predictions. its core principles are transparency, interpretability, and explainability
findable, accessible, interoperable, reusable (FAIR principles) principles for the management and stewardship of scientific data to ensure findability, accessibility, interoperability, and reusability
fine-tuning an approach to transfer learning (see “transfer learning” below) in which all or part of the weights of an artificial neural network pretrained on another task are further adjusted (“fine-tuned”) during the training on a new task
generalizability in the context of ML models, generalizability refers to a model’s ability to perform well on new data not used during the training process
generative models algorithms aiming to capture the data distribution of the training samples to be capable of generating novel samples resembling the data that they was trained on
inductive bias set of assumptions of a model used for predictions over unknown inputs. for example, the model can be built such that it can only predict values within a certain range consistent with the expected range of values for a particular problem
learning rate a parameter influencing the speed of the training of an ML model. a higher learning rate increases the effect of a single pass of the training data on the model’s parameters
loss function/cost function a function used to evaluate a model during training. by iteratively minimizing this function, the model updates the values of its parameters. a typical example is mean-squared error of prediction
machine learning (ML) machine learning, according to Mitchell, is the science that is “concerned with the question of how to construct computer programs that automatically improve with experience”. the terms ML and AI are often used interchangeably, but such usage is an oversimplification, as ML involves learning from data whereas AI can be more general
masking deep learning paradigm for self-supervised learning. neural networks trained in a masked-modeling regime acquire powerful understanding of data by learning to reconstruct masked parts of inputs, e.g., masked words in a sentence, numerical features artificially set to zeros, or hidden side-chains in a protein structure
multilayer perceptron (MLP) a basic architecture of artificial neural networks. it consists of multiple fully connected layers of neurons. each neuron calculates a weighted sum of inputs from the previous layer, applies a nonlinear activation function, and sends the result to the neurons in the next layer
multiple sequence alignment (MSA) a collection of protein or nucleic acid sequences aligned based on specific criteria, e.g., allowing introductions of gaps with a given penalty or providing substitution values for pairs of residue types, to maximize similarity at aligned sequence positions. sequence alignments provide useful insights into the evolutionary conservation of sequences
one-hot encoding for protein sequence each amino acid residue is represented by a 20-dimensional vector with a value of one at the position of the corresponding amino acid in the 20-letter alphabet and zeros elsewhere
overfitting case of an inappropriate training of ML model in which the model has too many degrees of freedom, and it is allowed to use these degrees of freedom to fit the noise in the training data during training. as a result, the model can reach seemingly excellent performance on the training data, but it will fail to generalize to new data
regularization in ML, regularization is a process used to prevent model overfitting. regularization techniques typically include adding a penalty on the magnitude of model parameters into the loss function to favor the use of low parameter magnitudes and thereby compress the parameter space. another popular regularization method for DL is to use a dropout layer, which randomly switches network neurons on and off during training
reinforcement learning (RL) a machine learning paradigm in which the problem is defined as finding an optimal sequence of actions in an environment where the value of each action is quantified using a system of rewards. for example, if RL were used to learn to play a boardgame, the model would function as a player, the actions would represent the moves available to the player, and the game would constitute the environment. using game simulations, the model learns to perform stronger actions based on the “rewards” (feedback from the environment, e.g., winning/losing the game) it receives for its past actions
self-supervised learning a machine learning paradigm of utilization of unlabeled data in a setting of supervised learning. In self-supervised learning, the key is to define a proxy task for which labels can be synthetically generated for the unlabeled data, and then the task can be learned in a supervised manner. for example, in natural language processing, a popular task is to take sentences written in natural language, mask (see “masking” above) words in those sentences, and learn to predict the missing word. such a proxy task can help the model to learn the distribution of the data it is supposed to work with
semisupervised learning a machine learning paradigm for tasks where the amount of labeled data is limited but there is an abundance of unlabeled data available. the unlabeled data are used to learn a general distribution of the data, aiding the learning of a supervised model. for example, all the data can be clustered by an unsupervised algorithm, and the unlabeled samples can be automatically labeled based on the labels present in the cluster, leading to enhancement of the data set for supervised learning, which can benefit its performance despite the lower quality of labeling
supervised learning a machine learning paradigm in which the goal is to predict a particular property known as a label for each data point. for example, the datapoints can be protein sequences and the property to be predicted would be the solubility (soluble/insoluble). training a model in a supervised way requires having the training data equipped with labels
transfer learning a machine learning technique in which a model is first trained for a particular task and then used (“transferred”) as a starting point for a different task. some of the learned weights can further be tuned to the new task (see “fine-tuning” above), or the transferred model can be used as a part of a new model that includes, for example, additional layers trained for the new task
transformer transformers learn to perform complex tasks by deducing how all parts of input objects, e.g., words in a sentence or amino acids in a protein sequence, are related to each other, using a mechanism called “attention”. the transformer architecture is currently one of the most prominent neural network architectures
underfitting the case of an insufficient training of a ML model, where the model could not capture the patterns in the available data well and exhibits high training error. it can be caused for example by a wrongly chosen model class, too strong regularization, an inappropriate learning rate, or too short training time
unsupervised learning a machine learning paradigm in which the goal is to identify patterns in unlabeled data and the data distribution. Typical examples of unsupervised learning techniques include clustering algorithms and data compression or projection methods such as principal component analysis. the advantage of these methods is the capability of handling unlabeled data, often at the expense of their predictive power
a

Focusing on their meaning in the context of this Perspective. The terms from this table are highlighted in bold upon their first usage in the text.