TABLE 1.
Diagnostic performance for all seven doctors and the Babylon Triage and Diagnostic System (Babylon AI).
Average recall (%)(95% CI) | Average precision (%)(95% CI) | F1-score (%)(95% CI) | Number of vignettes | |
---|---|---|---|---|
Doctor A | 80.9 | 42.9 | 56.1 | 47 |
Doctor B | 64.1 | 36.8 | 46.7 | 78 |
Doctor C | 93.8 | 53.5 | 68.1 | 48 |
Doctor D | 84.3 | 38.1 | 52.5 | 51 |
Doctor E | 90.0 | 33.9 | 49.2 | 70 |
Doctor F | 90.2 | 43.3 | 58.5 | 51 |
Doctor G | 84.3 | 56.5 | 67.7 | 51 |
Doctor average | 83.9 | 43.6 | 57.0 | 56.6 |
— | (75.6–92.3) | (36.3–50.9) | (49.7–64.2) | — |
Babylon AI | 80.0 | 44.4 | 57.1 | 100 |
The diagnostic performance of the Babylon Triage and Diagnostic System is comparable to that of doctors in terms of the recall, precision (positive predictive value) and F1-score (harmonic mean of precision and recall) against the disease modeled by the clinical vignette.