[Preprint]. 2023 Feb 28:rs.3.rs-2566942. [Version 1] doi: 10.21203/rs.3.rs-2566942/v1

Table 2.

Accuracy and Completeness Scores for AI-Generated Answers to Medical Questions

						Question Difficulty
		Accuracy by Question Type		Both Question Types		Easy		Medium		Hard
		Descriptive	Binary	Accuracy	Completeness	Accuracy	Completeness	Accuracy	Completeness	Accuracy
Multispecially, n=180	Median	5.0	5.0	5.0	3.0	5.0	3.0	5.0	3.0	5.0
	Mean	4.3	4.5	4.4	2.4	4.6	2.6	4.3	2.4	4.2
	SD	1.7	1.7	1.7	0.7	1.7	0.7	1.7	0.7	1.8
	IQR	3.0	3.0	5.0	1.0	3.0	1.0	3.0	1.0	3.8
Melanoma and Immunotherapy, n=44	Median	6.0	6.0	6.0	3.0	6.0	3.0	5.5	2.8	5.8
	Mean	5.1	5.4	5.2	2.6	5.9	3.0	4.8	2.2	5.3
	SD	1.5	1.2	1.3	0.8	0.3	0.1	1.7	0.2	1.1
	IQR	1.0	1.0	1.0	0.5	0.0	0.0	2.1	1.6	1.0
Common Conditions, n=60	Median	6.0	6.0	6.0	3.0	6.0	3.0	6.0	3.0	5.8
	Mean	5.6	5.8	5.7	2.8	5.9	2.9	5.6	2.7	5.6
	SD	0.6	0.8	0.7	0.5	0.4	0.2	1.0	0.6	0.1
	IQR	0.5	0.1	0.5	0.0	0.0	0.0	0.5	0.5	0.5
All, n=284	Median	5.0	6.0	5.5	3.0	6.0	3.0	5.5	3.0	5.0
	Mean	4.7	4.9	4.8	2.5	5.0	2.7	4.7	2.4	4.6
	SD	1.6	1.6	1.6	0.7	1.5	0.1	1.7	0.8	1.6
	IQR	2.6	2.0	2.0	1.0	1.0	0.5	2.6	1.0	2.0

Abbreviations: SD, Standard Deviation; IQR: Interquartile RangeThe accuracy scale was a six-point Likert scale (1 – completely incorrect, 2 – more incorrect than correct, 3 – Approximately equal correct and incorrect, 4 – more correct than incorrect, 5 – nearly all correct, 6 – correct), and the completeness scale was a three-point Likert scale (1 – incomplete, addresses some aspects of the question, but significant parts are missing or incomplete, 2 – adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete, 3 – comprehensive, addresses all aspects of the question and provides additional information or context beyond what was expected). Answers that were completely incorrect on the accuracy scale (score of 1) were not graded on comprehensiveness.