Feature/attribute/predictor |
A numerical (e.g. subset of real values) or categorical (i.e. a finite number of discrete values) value used as input to a learning algorithm |
Outcome/response/label |
A numerical or categorical value to predict from features |
Labelled data |
A set of features and labels for an observation |
Training set |
A collection of data used to train a learning algorithm |
Test set |
A collection of labelled data |
Supervised learning |
Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label) |
Unsupervised learning |
Learning techniques that group observations without a pre-specified dependent attribute. Clustering algorithms are usually unsupervised |
Model |
A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most learning algorithms generate models that can then be used in a decision-making process |
Accuracy |
The rate of correct predictions made by the model over a dataset. Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process. More complex accuracy estimation techniques, such as cross-validation and the bootstrap, are commonly used, especially with datasets containing a small number of observations |
High-dimensional problem |
Problems in which the number of features p is much larger than the number of observations N, often written p > N. Such problems have become of increasing importance, especially in genomics and other areas of computational biology |
Overfitting |
A modelling error that occurs when the model is too closely fit to a limited set of data points. As data being studied often has some degree of error or random noise, an overfitted model is poor in predicting new cases |
Multicollinearity |
Correlation between features, i.e. the situation where if the value of a feature change, values for the rest of features also change at some degree. When there is multicollinearity between variables in a regression model, its coefficients can become poorly determined and exhibit high variance |
K-fold cross-validation |
A method for estimating the accuracy (or error) of a learning algorithm by dividing the data into K mutually exclusive subsets (the ‘folds’) of approximately equal size. K models are trained and tested. Each time a model is trained on the data set minus a fold and tested on that fold. The accuracy estimate is the average accuracy for the K folds |