bagging |
acronym for bootstrap aggregating,
ensemble technique in which models are fitted on bootstrapped samples
from the data and then averaged |
bias |
error that remains for infinite
number of training examples, e.g., due to limited expressivity |
boosting |
ensemble technique in which
weak learners are iteratively combined to build a stronger learner |
bootstrapping |
calculate statistics by
randomly drawing samples with replacement |
classification |
process of assigning examples
to a particular class |
confidence interval |
interval of confidence around
predicted mean response |
feature |
vector with numeric encoding
of a description of a material that the ML uses for learning |
fidelity |
measure of how close a model
represents the real case |
fitting |
estimating parameters of
some models with high accuracy |
gradient descent |
optimization by following
the gradient, stochastic gradient descent approximates the gradient
using a mini-batch of the available data |
hyperparameters |
tuning parameters of the
learner (like learning rate, regularization strength) which, in contrast
to model parameters, are not learned during training and have to be
specified before training |
instance based learning |
learning by heart, query
data are compared to training examples to make a prediction |
irreducible error |
error that cannot be reduced
(e.g., due to noise in the data), i.e., that is also there for a perfect
model. Also known as Bayes error rate |
label (target) |
the property one wants to
predict |
objective function (cost
function) |
the function
that a ML algorithm
tries to minimize |
one-hot encoding |
method to represent categorical
variables by creating a feature column for each category and using
value of one to encode the presence and zero to encode the absence |
overfitting |
the gap between training
and test error is large, i.e., the model solely “remembers”
the training data but fails to predict on unseen examples |
predicting |
making predictions for future
samples with high accuracy |
prediction interval |
interval of confidence around
predicted sample response, always wider than confidence interval |
regression |
process of estimating the
continuous relationship between a dependent variable and one or more
independent variables |
regularization |
describes techniques that
add terms or information to the model to avoid overfitting |
stratification |
data is divided in homogeneous
subgroups (strata) such that sampling will not disturb the class distributions |
structured data |
data that is organized in
tables with rows and columns, i.e., data that resides in relational
databases |
test set |
collection
of labels and
feature vectors that is used for model evaluation and which must not
overlap with the training set |
training set |
collection of labels and
feature vectors that is used for training |
transfer |
use knowledge gained on
one distribution to perform inference on another distribution |
unstructured
data |
e.g., image,
video, audio,
text. i.e., data that is not organized in a tabular form |
validation set |
also known as development
set, collection of labels and feature vectors that is used for hyperparameter
tuning and which must not overlap with the test and training sets |
variance |
part of the error that is
due to finite-size effects (e.g., fluctuations due to random split
in training and test set) |