Skip to main content
. 2025 Jan 8;31(3):943–950. doi: 10.1038/s41591-024-03423-7

Fig. 2. Independent long-form evaluation with physician raters.

Fig. 2

Values are the proportion of ratings across answers where each axis was rated in the highest-quality bin. (For instance, ‘Possible harm extent = no harm’ reflects the proportion of answers where the extent of possible harm was rated ‘No harm.’) Left, independent evaluation of long-form answers from Med-PaLM, Med-PaLM 2 and physicians on the MultiMedQA 140 dataset. Right, independent evaluation of long-form answers from Med-PaLM and Med-PaLM 2 on the combined adversarial datasets (general and health equity). Detailed breakdowns are presented in Supplementary Tables 3 and 4. Error bars reflect 95% confidence intervals as determined by bootstrapping, centered on the mean proportions.