Table 2.
Performance metrics for machine learning models to predict drinking water quality
| Performance metric | Purpose | Definition | Limitation | Score range | Papers reported |
|---|---|---|---|---|---|
| Classification | |||||
| Accuracy | Determines proportion of total correct classifications |
Ranges between 0 and 1 |
Provides an overoptimistic estimation of the classifier ability on the majority class | Multi-class: 0.367–0.826 Binomial: 0.67–0.94 | [4•, 5, 7, 8••, 11, 12•, 72–74, 77–84] |
| Sensitivity | Determines model’s ability to recall true positives |
Ranges between 0 and 1 |
Sensitive to the classification threshold, lower threshold leads to high sensitivity | 0.07–0.84 | [4•, 5–7, 8••, 12•, 73, 74, 77–79, 81–84] |
| Specificity | Determines model’s ability to correctly classify true negatives |
Ranges between 0 and 1 |
Sensitive to the classification threshold, higher threshold leads to high specificity | 0.43–0.98 | [4•, 5, 7, 8••, 12•, 22, 23, 27, 28, 30–33] |
| Area under the receiver operator curve (AUC-ROC) or C-statistic | Determines probability that model will rank randomly chosen positive example higher than randomly chosen negative example |
The area under the curve of false positive rate vs true positive rate at different classification thresholds between 0 and 1 Ranges between 0 and 1 |
Only used for binary classification problem | 0.72–0.92 | [4•, 7, 8••, 9•, 10•, 12•, 73, 74, 78, 79, 81] |
| Matthew’s correlation coefficient (MCC) | Measures association between observed & predicted values |
Ranges between − 1 and 1 |
Applies to only one classification threshold | 0.31–0.72 | [80] |
| F1 Score | Finds the balance between precision and recall |
Ranges between 0 and 1 |
Applies to only one classification threshold | 0.46–0.74 | [79, 80] |
| Cohen’s kappa statistic | Determines how well machine learning classifier matched observations | Ranges between − 1 and 1 | Not easy to interpret | 0.46–0.62 | [7, 8••] |
| Regression | |||||
| Coefficient of determination (R2) | Determines proportion of variance explainable by predictors |
Ranges between 0 and 1 |
Increases with the number of predictors | 0.12–0.85 | [4•, 5, 6, 16, 71•, 82, 83, 85••, 86, 87] |
| Mean square error (MSE) | Measures how spread-out data is around line of best fit |
Ranges from 0 to ∞ |
Differs based on the scale of the response variable | 0.05–5.18 | [16, 76, 80, 83, 85••] |
| Mean absolute error (MAE) | Measures error between paired observation and prediction |
Ranges from 0 to ∞ |
No penalty for large errors in prediction | 0.13–3.06 | [80] |
TP, true positives; TN, true negatives; FP, false positives; FN, false negatives; p0, overall accuracy of the model; pe, measure of the agreement between the model predictions and the actual class values as if happening by chance