Table 1.
Diagnostic accuracy and triage accuracy of GPT-4 and physicians.
Accuracy | GPT-4 (n, %; 95% CIa) | Consensus of 3 physicians (n, %; 95% CI) | P valueb | ||||
Diagnosis | |||||||
|
Overall (n=45) | 44 (97.8; 88.2-99.9) | 41 (91.1; 79-98) | .38 | |||
|
Self-care (n=15) | 15 (100; 78.2-100) | 14 (93.3; 68.1-99.8) | .99 | |||
|
Nonemergent care (n=15) | 15 (100; 78.2-100) | 15 (100; 78.2-100) | .99 | |||
|
Emergent care (n=15) | 14 (93.3; 68.1-99.8) | 12 (80.0; 51.9-95.7) | .13 | |||
Triage | |||||||
|
Overall (n=45) | 30 (66.7; 51.0-80.0) | 30 (66.7; 51.0-80.0) | .99 | |||
|
Self-care (n=15) | 2 (13.3; 1.7-40.5) | 6 (40.0; 16.3-67.7) | .22 | |||
|
Nonemergent care (n=15) | 15 (100; 78.2-100) | 11 (73.3; 44.9-92.2) | .13 | |||
|
Emergent care (n=15) | 13 (86.7; 59.5-98.3) | 13 (86.7; 59.5-98.3) | .99 |
aCIs were calculated using the Clopper-Pearson method, and they are reported in percentages.
bThe performance of GPT-4 and that of physicians were compared using the McNemar test.