. 2022 Nov 24;13:1017340. doi: 10.3389/fgene.2022.1017340

TABLE 3.

Pros and cons of ML algorithms and applicability within the field of metabolomics.

Algorithm	Pros	Cons	Metabolomic application
Linear Regression	- Excellent for linearly separable data	- Assumes linear relationship between dependent and independent variables	- Unknown relationship between dependent and independent variables
	- Easy implementation	- Outliers have significant impact	- Forecasting tasks
		- Prone to overfitting
Logistic Regression	- Simple implementation	- Easily outperformed by other algorithms	- Multiclass classification, i.e., when output class only has two possible outcomes e.g., cancer detection (yes or no)
	- No Feature scaling needed	- Heavily reliant on proper identification of data	- Linear relationship between dependent and independent variables
	- No hyper-parameter tuning needed
Naive Bayes	- Fast predictions of dataset classes	- Assumes all features are independent	- Dataset with highly independent features
Naive Bayes	- Good for datasets with categorical variables	- Assumes all features are independent	- For multi-class predictions
Support Vector Machines (SVMs)	- Works well for data that can be easily seperated with clear margin of separation	- Requires more training time for large datasets	- Medium sized dataset
	- Effective for high dimension data	- Does not perform well when dataset has high level of noise i.e. overlapping target classes	- Large number of features
			- Linear relationship between dependent and independent variables
k-Nearest Neighbors (k-NN)	- Easy implementation	- Slow performance on large datasets	- Small datasets with small number of features
	- Can solve multi-class problems	- Data scaling required	- Unknown relationship between dependant and independent variables
	- No data assumption needed	- Not for data with high dimensionality i.e. large number of features	- Useful for targeted metabolomics approaches
		- Sensitive to missing values, outliers and imbalance data
Decision Trees	- Scaling or normalization of data not needed	- Data sensitive	- Known to suffer from a high chance of overfitting
	- Able to handle missing values	- Might need more time to train trees
	- Easy to visualize	- High chance of overfitting
	- Automatic feature selection
Random Forest (RF)	- Good performance on imbalanced or missing data	- Predictions are uncorrelated	- Identification of variables with high importance
	- Able to handle huge amounts of data	- Influence of dependent variable on independent variable is unknown, i.e., Black box	- Useful for datasets with small sample population
	- Feature importance extraction	- Data sensitive	- Useful for metabolic fingerprinting approaches
	- Low chance of overfitting
Neural Networks (NN)	- Flexible network architecture i.e., can be used for regression and classification	- Influence of dependent variable on the independent variable is unknown, i.e., Black box	- Data with a non-linear relationship between dependant and independent variables
	- Good with nonlinear data	- Highly dependant on training data	- Large datasets with a stipulation on time and cost
	- Can handle large number of inputs	- Prone to overfitting and generalization	- Can be applied to raw metabolomic data for feature extraction and multivariate classification combined into a single model
	- Fast predictions once trained	- Extremely hardware dependant i.e., the larger the datasets, the more expensive and time-consuming the modeling process	- Integration of multi-omics data, i.e., collected over different times, multiple analytical platforms, biofluids, or omic platforms
			- Useful for metabolic profiling