Feature selection: Feature selection is a type of dimensionality reduction that involves selecting a subset of features from the original feature set, which can potentially improve a model’s performance. As every feature added to the machine learning model increases the complexity of the model and risk of overfitting (when the model performs well on training data but fails on new data), thereby complicating the inferences. Feature selection aims at reducing redundancy while selecting the most relevant features. |
Training: Training a model involves passing the processed data to a machine learning algorithm to learn general rules and patterns in the data. Usually, the goal is to optimize model parameters such that it is generalizable (able to perform well on unseen testing data) while maintaining accuracy. |
Supervised learning: Supervised learning is a group of machine learning techniques that use labeled data in the form of prior knowledge (gold standard) as input to train the model. The model learns patterns that characterize samples with known labels, and these patterns can then be used to predict the labels of new data. Regression (continuous value prediction) and classification (discrete value prediction) are two types of supervised learning. |
Unsupervised learning: Unsupervised learning is a branch of machine learning that involves training a model using unlabeled data (input without a labeled output) based on structure and intuition. Clustering is a popular example of unsupervised learning. |
Performance metrics: Performance metrics help us to assess the performance of the machine learning/deep learning models. Some of the common metrics are as follows: |
a. Confusion matrix: False positives (FP) are the number of negative samples which were wrongly predicted as being positive; false negatives (FN) are the number of positive samples which were wrongly predicted as being negative. Accurate predictions are true positives (TP: number of truly positive samples correctly predicted) and true negatives (TN: number of truly negative samples correctly predicted). |
b. Accuracy (ACC)—This is mostly used for classification tasks. It tells us the ratio of correctly predicted labels among all the labels. It ranges between 0 and 1 where 1 means all samples are correctly predicted and 0 means random guess. |
c. Area under the curve (AUC)—Also used in classification tasks. It tells us how well the model can differentiate among classes at various thresholds. Higher AUCs correspond to models that can better distinguish between disease (usually class 1) and healthy (usually class 0) patients. The values range from 0 to 1 and are usually compared with random guessing (AUC of 0.5). |
d. Mean squared error (MSE)—It is mostly used in regression purposes. It measures the average of the squared difference between the predicted values and the respective ground truth values. Intuitively, it computes the variance of the residuals. |
e. Mean absolute error (MAE)—Widely used in regression tasks, it measures the absolute distance between the predicted and the ground truth labels. |
f. Purity—This metric is used in clustering unsupervised learning approaches. It measures how well each cluster contains an individual class. |
g. F1 score (F1)—Harmonic mean between precision and recall. The values can be between 0 and 1, where predictive models try to achieve F1 scores close to 1. |
Model evaluation: Model evaluation involves assessing the generalizability of the model. It helps in determining if the trained model will generalize to unseen data. A popular technique to evaluate models is k-fold cross-validation. Cross-validation splits training data into k distinct splits; the model is trained on k-1 splits and evaluated on the split not used for training. The procedure is repeated k times ensuring that each split is used as a test set only once. |