TABLE 3.
Pros and cons of ML algorithms and applicability within the field of metabolomics.
Algorithm | Pros | Cons | Metabolomic application |
---|---|---|---|
Linear Regression | - Excellent for linearly separable data | - Assumes linear relationship between dependent and independent variables | - Unknown relationship between dependent and independent variables |
- Easy implementation | - Outliers have significant impact | - Forecasting tasks | |
- Prone to overfitting | |||
Logistic Regression | - Simple implementation | - Easily outperformed by other algorithms | - Multiclass classification, i.e., when output class only has two possible outcomes e.g., cancer detection (yes or no) |
- No Feature scaling needed | - Heavily reliant on proper identification of data | - Linear relationship between dependent and independent variables | |
- No hyper-parameter tuning needed | |||
Naive Bayes | - Fast predictions of dataset classes | - Assumes all features are independent | - Dataset with highly independent features |
- Good for datasets with categorical variables | - For multi-class predictions | ||
Support Vector Machines (SVMs) | - Works well for data that can be easily seperated with clear margin of separation | - Requires more training time for large datasets | - Medium sized dataset |
- Effective for high dimension data | - Does not perform well when dataset has high level of noise i.e. overlapping target classes | - Large number of features | |
- Linear relationship between dependent and independent variables | |||
k-Nearest Neighbors (k-NN) | - Easy implementation | - Slow performance on large datasets | - Small datasets with small number of features |
- Can solve multi-class problems | - Data scaling required | - Unknown relationship between dependant and independent variables | |
- No data assumption needed | - Not for data with high dimensionality i.e. large number of features | - Useful for targeted metabolomics approaches | |
- Sensitive to missing values, outliers and imbalance data | |||
Decision Trees | - Scaling or normalization of data not needed | - Data sensitive | - Known to suffer from a high chance of overfitting |
- Able to handle missing values | - Might need more time to train trees | ||
- Easy to visualize | - High chance of overfitting | ||
- Automatic feature selection | |||
Random Forest (RF) | - Good performance on imbalanced or missing data | - Predictions are uncorrelated | - Identification of variables with high importance |
- Able to handle huge amounts of data | - Influence of dependent variable on independent variable is unknown, i.e., Black box | - Useful for datasets with small sample population | |
- Feature importance extraction | - Data sensitive | - Useful for metabolic fingerprinting approaches | |
- Low chance of overfitting | |||
Neural Networks (NN) | - Flexible network architecture i.e., can be used for regression and classification | - Influence of dependent variable on the independent variable is unknown, i.e., Black box | - Data with a non-linear relationship between dependant and independent variables |
- Good with nonlinear data | - Highly dependant on training data | - Large datasets with a stipulation on time and cost | |
- Can handle large number of inputs | - Prone to overfitting and generalization | - Can be applied to raw metabolomic data for feature extraction and multivariate classification combined into a single model | |
- Fast predictions once trained | - Extremely hardware dependant i.e., the larger the datasets, the more expensive and time-consuming the modeling process | - Integration of multi-omics data, i.e., collected over different times, multiple analytical platforms, biofluids, or omic platforms | |
- Useful for metabolic profiling |