Skip to main content
[Preprint]. 2023 May 2:rs.3.rs-2883198. [Version 1] doi: 10.21203/rs.3.rs-2883198/v1

Fig. 2. ClinicalQA Performance.

Fig. 2

Comparison of performances between Almanac and ChatGPT on the ClinicalQA dataset as evaluated by physicians. Almanac outperforms its counterpart with significant gains in factuality, and marginal improvements in completeness. Although more robust to adversarial prompts, Almanac and ChatGPT both exhibit hallucinations with omission. Despite these performances, ChatGPT answers are preferred 57% of the time. Error bars shown visualize standard error (SE)