. 2022 Dec 17;10(1):45–60. doi: 10.1007/s40572-022-00389-x

Table 2.

Performance metrics for machine learning models to predict drinking water quality

Performance metric	Purpose	Definition	Limitation	Score range	Papers reported
Classification
Accuracy	Determines proportion of total correct classifications	$\frac{(T P + T N)}{T P + T N + F P + F N}$ Ranges between 0 and 1	Provides an overoptimistic estimation of the classifier ability on the majority class	Multi-class: 0.367–0.826 Binomial: 0.67–0.94	[4•, 5, 7, 8••, 11, 12•, 72–74, 77–84]
Sensitivity	Determines model’s ability to recall true positives	$\frac{TP}{T P + F N}$ Ranges between 0 and 1	Sensitive to the classification threshold, lower threshold leads to high sensitivity	0.07–0.84	[4•, 5–7, 8••, 12•, 73, 74, 77–79, 81–84]
Specificity	Determines model’s ability to correctly classify true negatives	$\frac{TN}{T N + F P}$ Ranges between 0 and 1	Sensitive to the classification threshold, higher threshold leads to high specificity	0.43–0.98	[4•, 5, 7, 8••, 12•, 22, 23, 27, 28, 30–33]
Area under the receiver operator curve (AUC-ROC) or C-statistic	Determines probability that model will rank randomly chosen positive example higher than randomly chosen negative example	The area under the curve of false positive rate vs true positive rate at different classification thresholds between 0 and 1 Ranges between 0 and 1	Only used for binary classification problem	0.72–0.92	[4•, 7, 8••, 9•, 10•, 12•, 73, 74, 78, 79, 81]
Matthew’s correlation coefficient (MCC)	Measures association between observed & predicted values	$\frac{T P * T N - F P * F N}{\sqrt{(T P + F P) * (T P + F N) * (T N + F P) * (T N + F N)}}$ Ranges between − 1 and 1	Applies to only one classification threshold	0.31–0.72	[80]
F1 Score	Finds the balance between precision and recall	$\frac{2 * T P}{2 * T P + F P + F N}$ Ranges between 0 and 1	Applies to only one classification threshold	0.46–0.74	[79, 80]
Cohen’s kappa statistic	Determines how well machine learning classifier matched observations	$\frac{p_{0} - p_{e}}{1 - p_{e}}$ Ranges between − 1 and 1	Not easy to interpret	0.46–0.62	[7, 8••]
Regression
Coefficient of determination (R²)	Determines proportion of variance explainable by predictors	$1 - \frac{\sum {(y_{i} - \hat{y_{i}})}^{2}}{\sum {(y_{i} - μ_{y})}^{2}}$ Ranges between 0 and 1	Increases with the number of predictors	0.12–0.85	[4•, 5, 6, 16, 71•, 82, 83, 85••, 86, 87]
Mean square error (MSE)	Measures how spread-out data is around line of best fit	$\frac{1}{n} \sum {(y_{i} - \hat{y_{i}})}^{2}$ Ranges from 0 to ∞	Differs based on the scale of the response variable	0.05–5.18	[16, 76, 80, 83, 85••]
Mean absolute error (MAE)	Measures error between paired observation and prediction	$\frac{1}{n} \sum \|y_{i} - \hat{y_{i}}\|$ Ranges from 0 to ∞	No penalty for large errors in prediction	0.13–3.06	[80]

TP, true positives; TN, true negatives; FP, false positives; FN, false negatives; p₀, overall accuracy of the model; p_e, measure of the agreement between the model predictions and the actual class values as if happening by chance