Skip to main content
. 2016 Jul 13;46(12):2455–2465. doi: 10.1017/S0033291716001367

Table 2.

Glossary of statistical/machine learning terms used in this paper

Term Definition
Feature/attribute/predictor A numerical (e.g. subset of real values) or categorical (i.e. a finite number of discrete values) value used as input to a learning algorithm
Outcome/response/label A numerical or categorical value to predict from features
Labelled data A set of features and labels for an observation
Training set A collection of data used to train a learning algorithm
Test set A collection of labelled data
Supervised learning Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label)
Unsupervised learning Learning techniques that group observations without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised
Model A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most learning algorithms generate models that can then be used in a decision-making process
Accuracy The rate of correct predictions made by the model over a dataset. Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process. More complex accuracy estimation techniques, such as cross-validation and the bootstrap, are commonly used, especially with datasets containing a small number of observations
High-dimensional problem Problems in which the number of features p is much larger than the number of observations N, often written p > N. Such problems have become of increasing importance, especially in genomics and other areas of computational biology
Overfitting A modelling error that occurs when the model is too closely fit to a limited set of data points. As data being studied often has some degree of error or random noise, an overfitted model is poor in predicting new cases
Multicollinearity Correlation between features, i.e. the situation where if the value of a feature change, values for the rest of features also change at some degree. When there is multicollinearity between variables in a regression model, its coefficients can become poorly determined and exhibit high variance
K-fold cross-validation A method for estimating the accuracy (or error) of a learning algorithm by dividing the data into K mutually exclusive subsets (the ‘folds’) of approximately equal size. K models are trained and tested. Each time a model is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the K folds