Table 2.
Rubric for the evaluation of chain-of-thought (CoT) reliabilitya.
| Metric | Definition |
| Logical coherence and clarity | Assesses whether the reasoning process is internally consistent, logically structured, and expressed clearly and understandably. |
| Use and coverage of key information | Evaluates the extent to which the reasoning incorporates and addresses relevant clinical data points presented in the input. |
| Plausibility and clinical accuracy of reasoning | Measures whether the reasoning is clinically sound, aligns with standard medical knowledge, and leads to a reasonable interpretation or decision. Deduct points as appropriate across the 4 parts in the analysis. |
aThe table defines the 3 dimensions: logical coherence and clarity, use and coverage of key information, and plausibility and clinical accuracy of reasoning, used by both human experts and the artificial intelligence evaluator to assess the quality of generated CoTs on a 5-point Likert scale.