Skip to main content
[Preprint]. 2023 Feb 28:rs.3.rs-2566942. [Version 1] doi: 10.21203/rs.3.rs-2566942/v1

Table 2.

Accuracy and Completeness Scores for AI-Generated Answers to Medical Questions

Question Difficulty
Accuracy by Question Type Both Question Types Easy Medium Hard
Descriptive Binary Accuracy Completeness Accuracy Completeness Accuracy Completeness Accuracy
Multispecially, n=180 Median 5.0 5.0 5.0 3.0 5.0 3.0 5.0 3.0 5.0
Mean 4.3 4.5 4.4 2.4 4.6 2.6 4.3 2.4 4.2
SD 1.7 1.7 1.7 0.7 1.7 0.7 1.7 0.7 1.8
IQR 3.0 3.0 5.0 1.0 3.0 1.0 3.0 1.0 3.8
Melanoma and Immunotherapy, n=44 Median 6.0 6.0 6.0 3.0 6.0 3.0 5.5 2.8 5.8
Mean 5.1 5.4 5.2 2.6 5.9 3.0 4.8 2.2 5.3
SD 1.5 1.2 1.3 0.8 0.3 0.1 1.7 0.2 1.1
IQR 1.0 1.0 1.0 0.5 0.0 0.0 2.1 1.6 1.0
Common Conditions, n=60 Median 6.0 6.0 6.0 3.0 6.0 3.0 6.0 3.0 5.8
Mean 5.6 5.8 5.7 2.8 5.9 2.9 5.6 2.7 5.6
SD 0.6 0.8 0.7 0.5 0.4 0.2 1.0 0.6 0.1
IQR 0.5 0.1 0.5 0.0 0.0 0.0 0.5 0.5 0.5
All, n=284 Median 5.0 6.0 5.5 3.0 6.0 3.0 5.5 3.0 5.0
Mean 4.7 4.9 4.8 2.5 5.0 2.7 4.7 2.4 4.6
SD 1.6 1.6 1.6 0.7 1.5 0.1 1.7 0.8 1.6
IQR 2.6 2.0 2.0 1.0 1.0 0.5 2.6 1.0 2.0

Abbreviations: SD, Standard Deviation; IQR: Interquartile RangeThe accuracy scale was a six-point Likert scale (1 – completely incorrect, 2 – more incorrect than correct, 3 – Approximately equal correct and incorrect, 4 – more correct than incorrect, 5 – nearly all correct, 6 – correct), and the completeness scale was a three-point Likert scale (1 – incomplete, addresses some aspects of the question, but significant parts are missing or incomplete, 2 – adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete, 3 – comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected). Answers that were completely incorrect on the accuracy scale (score of 1) were not graded on comprehensiveness.