Table 2.
Comparative analysis of data quality improvement strategies in healthcare.
| Study | Methods used | Accuracy/AUC | Completeness | Reproducibility/strategy |
|---|---|---|---|---|
| This study | KNN imputation, anomaly detection (IF, IQR, LOF), PCA, RF | 75.3%, AUC 0.83 | ~100% | MLflow tracking, reproducible pipeline |
| Chen et al. (2021) | Transfer learning, data quality evaluation, CNN/RNN | ~ + 11% accuracy improvement after cleaning | Implicitly improved | Structured pipeline with medical concept normalization |
| Kale and Pandey (2024) | KNN, SMOTE, clustering-based anomaly detection, multiple classifiers | Accuracy 70% → 91%, F1-score 0.65 → 0.89 | Substantial post-imputation | Clear before/after metric reporting |
| Emery et al. (2024) | Imputation: MICE, hot-deck, log-linear, MICT-timing | 65–74% (simulated data) | Trajectory coverage improved | Reproducible in R, used on longitudinal data |
| Azimi and Pahl (2024) | Anomaly analytics, ML quality metadata exploration | Not quantified; improved ML output reliability | Qualitative evaluation | Anomaly visualization and root cause strategy |