Fig. 2. Independent long-form evaluation with physician raters.
Values are the proportion of ratings across answers where each axis was rated in the highest-quality bin. (For instance, ‘Possible harm extent = no harm’ reflects the proportion of answers where the extent of possible harm was rated ‘No harm.’) Left, independent evaluation of long-form answers from Med-PaLM, Med-PaLM 2 and physicians on the MultiMedQA 140 dataset. Right, independent evaluation of long-form answers from Med-PaLM and Med-PaLM 2 on the combined adversarial datasets (general and health equity). Detailed breakdowns are presented in Supplementary Tables 3 and 4. Error bars reflect 95% confidence intervals as determined by bootstrapping, centered on the mean proportions.
