Skip to main content
. 2020 Jun 10;120(16):8066–8129. doi: 10.1021/acs.chemrev.0c00004

Table 1. Common Technical Terms Used in ML and Their Meanings.

technical term explanation
bagging acronym for bootstrap aggregating, ensemble technique in which models are fitted on bootstrapped samples from the data and then averaged
bias error that remains for infinite number of training examples, e.g., due to limited expressivity
boosting ensemble technique in which weak learners are iteratively combined to build a stronger learner
bootstrapping calculate statistics by randomly drawing samples with replacement
classification process of assigning examples to a particular class
confidence interval interval of confidence around predicted mean response
feature vector with numeric encoding of a description of a material that the ML uses for learning
fidelity measure of how close a model represents the real case
fitting estimating parameters of some models with high accuracy
gradient descent optimization by following the gradient, stochastic gradient descent approximates the gradient using a mini-batch of the available data
hyperparameters tuning parameters of the learner (like learning rate, regularization strength) which, in contrast to model parameters, are not learned during training and have to be specified before training
instance based learning learning by heart, query data are compared to training examples to make a prediction
irreducible error error that cannot be reduced (e.g., due to noise in the data), i.e., that is also there for a perfect model. Also known as Bayes error rate
label (target) the property one wants to predict
objective function (cost function) the function that a ML algorithm tries to minimize
one-hot encoding method to represent categorical variables by creating a feature column for each category and using value of one to encode the presence and zero to encode the absence
overfitting the gap between training and test error is large, i.e., the model solely “remembers” the training data but fails to predict on unseen examples
predicting making predictions for future samples with high accuracy
prediction interval interval of confidence around predicted sample response, always wider than confidence interval
regression process of estimating the continuous relationship between a dependent variable and one or more independent variables
regularization describes techniques that add terms or information to the model to avoid overfitting
stratification data is divided in homogeneous subgroups (strata) such that sampling will not disturb the class distributions
structured data data that is organized in tables with rows and columns, i.e., data that resides in relational databases
test set collection of labels and feature vectors that is used for model evaluation and which must not overlap with the training set
training set collection of labels and feature vectors that is used for training
transfer use knowledge gained on one distribution to perform inference on another distribution
unstructured data e.g., image, video, audio, text. i.e., data that is not organized in a tabular form
validation set also known as development set, collection of labels and feature vectors that is used for hyperparameter tuning and which must not overlap with the test and training sets
variance part of the error that is due to finite-size effects (e.g., fluctuations due to random split in training and test set)