Table 1.
Glossary of terms.
| Term | Description | Further reading / reference material | 
|---|---|---|
| Chem- / bioinformatics related terms | ||
| Chemical Entities of Biological Interest (ChEBI) ontology | Controlled vocabulary used to classify small molecules used to intervene in the processes of living organisms, based on e.g, their biological role, chemical properties. | Degtyarenko et al. [67] | 
| Computer-aided synthesis planning (CASP) | Computational planning of steps to synthesize a target chemical compound from available starting materials. | Engkvist et al. [68], Ravitz [69], Warr [70] | 
| Enzyme Commission (EC) numbers | Numerical classification system for enzymes based on the reaction they catalyze. The first three numbers (levels) describe the type of catalytic activity while the fourth level specifies the substrate. | McDonald and Tipton [71] | 
| Gene Ontology (GO) | Controlled vocabulary that can be used to classify the function of gene products (e.g. proteins) based on biological process, molecular function and site of cellular localization. | Gene Ontology Consortium [72] | 
| InChIKeys | Widely-used and unique identifiers for chemical compounds that are derived from hashing InChI (International IUPAC Identifiers) notations. | Goodman et al. [73] | 
| Orphan enzymatic reactions | Reactions not known to be catalyzed by enzymes. | - | 
| Reaction rules | A scheme that describes how reactants are converted to products. Useful for cheminformatic tasks such as retrosynthesis route-planning to transform reactants into products. | Plehiers et al. [74] | 
| Retrosynthesis | A way of synthesis route planning that begins with the target chemical product and searches for the best possible synthetic route, arriving at reactants that are inexpensive and easily obtainable. | Klucznik et al. [75]; https://www.elsevier.com/solutions/reaxys/predictive-retrosynthesis) | 
| Retrobiosynthesis | An approach of route planning for biochemical synthesis. Biosynthetic routes that arrive at abundant reactants (cellular metabolites) are prioritized in order to maximize biosynthetic yield. | de Souza et al. [28], Mohammadi Peyhani et al. [76], Probst et al. [29] | 
| Structure-activity relationship (SAR) | Relationship that describes how structural properties of molecules relate to their (bio)activities. | Guha [77] | 
| Machine-learning related terms | ||
| Data labels | Targeted output to train a supervised machine learning model. | - | 
| Data labeling / annotation | Generation of data labels for sample data that are otherwise unlabelled. Labeling / annotation can be achieved via manual, semi-automatic or automatic means. | - | 
| Dimensionality (of features) | The number of features. | - | 
| Features | Refers to Input that has been preprocessed from sample data (often into numerical or binary values) to be fed directly into a machine learning model in order to generate an output value. Feature (singular) refers to a single numerical / binary value from the set of input. | - | 
| Machine learning | Use of data and algorithms that learns iteratively to improve its accuracy of predictions or make decisions that can give the best outcome. | Greener et al. [78] | 
| Model | An algorithm that can recognize patterns, make predictions or make decisions based on given input. | - | 
| Model validation | Process of using the model to predict the output of samples outside of the training dataset to evaluate the predictive performance of a model. | - | 
| Reinforcement learning | A machine learning method that improves iteratively to maximize reward. | [79] | 
| Neural Network | A type of supervised machine learning algorithm that comprises an input layer, a hidden layer (can be more than one) and an output layer. Each layer consists of nodes that are connected to every node in adjacent layers via edges. Data is fed into the network via the input layer and processed as it propagates through the hidden layer(s) towards the output layer to give an output (often a prediction). Each edge is associated with weights (parameters) that transform the data from one node to another and can be adjusted during learning to improve accuracy of the prediction. Neural networks are also known as artificial neural networks (ANN). | Greener et al. [78]; https://playground.tensorflow.org) | 
| One-hot encoding | A way of converting one column of categorical features into multiple binary columns. | Greener et al. [78]; https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) | 
| Overfitting | A phenomenon where a model learns irrelevant information from the training dataset resulting in the degradation of predictive performance on never-before-seen data. This can be caused by having a model that has too many parameters (too complex) or not having enough sample data to train the model. | Chicco [80], Greener et al. [78] | 
| Parameters | Variables within the model that governs how input data is transformed into the output. Machine learning models self-adjust these parameters during training to minimize the error between output and labels. | - | 
| Self-supervised (machine learning) | A subset of unsupervised machine learning algorithms that are able to take on tasks which are traditionally tackled by supervised machine learning, without using data labels. | Spathis et al. [81] | 
| Sparsity (of features) | The number of features with zero values. | - | 
| Supervised (machine learning) | A machine learning approach that uses a labeled dataset to improve the its prediction of outcomes. | - | 
| The curse of dimensionality | The relationship between the predictive performance of machine learning models and dimensionality of input features. Performance first improves with increased dimensionality but starts to degrade past a certain point if the number of training samples remains the same. This is attributed to the exponential increase in training data needed to prevent overfitting as dimensions increase. | (https://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality) | 
| Training dataset and testing dataset | Subsets of sample data generated after a train/test split. Training data are used to train the model while testing dataset will be used to evaluate the model performance. | - | 
| Transfer-learning | An approach of applying knowledge learned for one task to a different but related task to improve sample efficiency. This can effectively be achieved by training an already pre-trained model instead of building a model from scratch. | Cai et al. [82] | 
| Unsupervised (machine learning) | A machine learning approach that uses algorithms to infer patterns from a dataset without the use of data labels. | - |