Table 2.
Accuracy and Completeness Scores for AI-Generated Answers to Medical Questions
| Question Difficulty | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy by Question Type | Both Question Types | Easy | Medium | Hard | ||||||
| Descriptive | Binary | Accuracy | Completeness | Accuracy | Completeness | Accuracy | Completeness | Accuracy | ||
| Multispecially, n=180 | Median | 5.0 | 5.0 | 5.0 | 3.0 | 5.0 | 3.0 | 5.0 | 3.0 | 5.0 |
| Mean | 4.3 | 4.5 | 4.4 | 2.4 | 4.6 | 2.6 | 4.3 | 2.4 | 4.2 | |
| SD | 1.7 | 1.7 | 1.7 | 0.7 | 1.7 | 0.7 | 1.7 | 0.7 | 1.8 | |
| IQR | 3.0 | 3.0 | 5.0 | 1.0 | 3.0 | 1.0 | 3.0 | 1.0 | 3.8 | |
| Melanoma and Immunotherapy, n=44 | Median | 6.0 | 6.0 | 6.0 | 3.0 | 6.0 | 3.0 | 5.5 | 2.8 | 5.8 |
| Mean | 5.1 | 5.4 | 5.2 | 2.6 | 5.9 | 3.0 | 4.8 | 2.2 | 5.3 | |
| SD | 1.5 | 1.2 | 1.3 | 0.8 | 0.3 | 0.1 | 1.7 | 0.2 | 1.1 | |
| IQR | 1.0 | 1.0 | 1.0 | 0.5 | 0.0 | 0.0 | 2.1 | 1.6 | 1.0 | |
| Common Conditions, n=60 | Median | 6.0 | 6.0 | 6.0 | 3.0 | 6.0 | 3.0 | 6.0 | 3.0 | 5.8 |
| Mean | 5.6 | 5.8 | 5.7 | 2.8 | 5.9 | 2.9 | 5.6 | 2.7 | 5.6 | |
| SD | 0.6 | 0.8 | 0.7 | 0.5 | 0.4 | 0.2 | 1.0 | 0.6 | 0.1 | |
| IQR | 0.5 | 0.1 | 0.5 | 0.0 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | |
| All, n=284 | Median | 5.0 | 6.0 | 5.5 | 3.0 | 6.0 | 3.0 | 5.5 | 3.0 | 5.0 |
| Mean | 4.7 | 4.9 | 4.8 | 2.5 | 5.0 | 2.7 | 4.7 | 2.4 | 4.6 | |
| SD | 1.6 | 1.6 | 1.6 | 0.7 | 1.5 | 0.1 | 1.7 | 0.8 | 1.6 | |
| IQR | 2.6 | 2.0 | 2.0 | 1.0 | 1.0 | 0.5 | 2.6 | 1.0 | 2.0 | |
Abbreviations: SD, Standard Deviation; IQR: Interquartile RangeThe accuracy scale was a six-point Likert scale (1 – completely incorrect, 2 – more incorrect than correct, 3 – Approximately equal correct and incorrect, 4 – more correct than incorrect, 5 – nearly all correct, 6 – correct), and the completeness scale was a three-point Likert scale (1 – incomplete, addresses some aspects of the question, but significant parts are missing or incomplete, 2 – adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete, 3 – comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected). Answers that were completely incorrect on the accuracy scale (score of 1) were not graded on comprehensiveness.