Skip to main content
. 2022 Nov 24;13:1017340. doi: 10.3389/fgene.2022.1017340

TABLE 3.

Pros and cons of ML algorithms and applicability within the field of metabolomics.

Algorithm Pros Cons Metabolomic application
Linear Regression - Excellent for linearly separable data - Assumes linear relationship between dependent and independent variables - Unknown relationship between dependent and independent variables
- Easy implementation - Outliers have significant impact - Forecasting tasks
- Prone to overfitting
Logistic Regression - Simple implementation - Easily outperformed by other algorithms - Multiclass classification, i.e., when output class only has two possible outcomes e.g., cancer detection (yes or no)
- No Feature scaling needed - Heavily reliant on proper identification of data - Linear relationship between dependent and independent variables
- No hyper-parameter tuning needed
Naive Bayes - Fast predictions of dataset classes - Assumes all features are independent - Dataset with highly independent features
- Good for datasets with categorical variables - For multi-class predictions
Support Vector Machines (SVMs) - Works well for data that can be easily seperated with clear margin of separation - Requires more training time for large datasets - Medium sized dataset
- Effective for high dimension data - Does not perform well when dataset has high level of noise i.e. overlapping target classes - Large number of features
- Linear relationship between dependent and independent variables
k-Nearest Neighbors (k-NN) - Easy implementation - Slow performance on large datasets - Small datasets with small number of features
- Can solve multi-class problems - Data scaling required - Unknown relationship between dependant and independent variables
- No data assumption needed - Not for data with high dimensionality i.e. large number of features - Useful for targeted metabolomics approaches
- Sensitive to missing values, outliers and imbalance data
Decision Trees - Scaling or normalization of data not needed - Data sensitive - Known to suffer from a high chance of overfitting
- Able to handle missing values - Might need more time to train trees
- Easy to visualize - High chance of overfitting
- Automatic feature selection
Random Forest (RF) - Good performance on imbalanced or missing data - Predictions are uncorrelated - Identification of variables with high importance
- Able to handle huge amounts of data - Influence of dependent variable on independent variable is unknown, i.e., Black box - Useful for datasets with small sample population
- Feature importance extraction - Data sensitive - Useful for metabolic fingerprinting approaches
- Low chance of overfitting
Neural Networks (NN) - Flexible network architecture i.e., can be used for regression and classification - Influence of dependent variable on the independent variable is unknown, i.e., Black box - Data with a non-linear relationship between dependant and independent variables
- Good with nonlinear data - Highly dependant on training data - Large datasets with a stipulation on time and cost
- Can handle large number of inputs - Prone to overfitting and generalization - Can be applied to raw metabolomic data for feature extraction and multivariate classification combined into a single model
- Fast predictions once trained - Extremely hardware dependant i.e., the larger the datasets, the more expensive and time-consuming the modeling process - Integration of multi-omics data, i.e., collected over different times, multiple analytical platforms, biofluids, or omic platforms
- Useful for metabolic profiling